glassfish
  1. glassfish
  2. GLASSFISH-13212

cluster nodes recieve inconsistent notifiactions 'failure' and 'joined and ready'

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Incomplete
    • Affects Version/s: 3.1
    • Fix Version/s: not determined
    • Labels:
      None
    • Environment:

      Operating System: Linux
      Platform: Linux

    • Issuezilla Id:
      13,212

      Description

      build: ogs-3.1-web-b18-08_30_2010

      • start DAS
      • wait for DAS to start
      • start cluster (9 CORE instances on 9 machines)
      • wait for all cluster instances to start
      • kill n1c1m4
      • wait 20 seconds
      • restart n1c1m4
      • wait 5 seconds
      • stop cluster
      • wait all cluster CORE nodes to stop
      • stop DAS
      • wait to DAS to stop
      • collect logs

      bug:
      Node 9 got 'joined and ready' notification whereas others got failure notifications.

      Expected:
      'joined and ready' and 'failure notification' should be mutually exclusive for
      nodes of a cluster.
      Otherwise, nodes receiving different notifications could take different business
      actions and take the system into an inconsistent state.

      Despite giving 15 seconds time after starting a cluster and after stopping the
      cluster failure and appointed notifications are not seen in some cluster instances.
      logs:
      http://aras2.sfbay.sun.com:8080/testresults/export1/gms/gf31/gms//set_08_31_10_t_11_15_02/scenario_0003_Tue_Aug_31_11_38_25_PDT_2010.html

        Activity

        Hide
        Joe Fialli added a comment -

        It is difficult to analyze this issue due to significant time skew between
        instances in the cluster.

        The following information is extracted from following file:
        http://aras2.sfbay.sun.com:8080/testresults/export1/gms/gf31/gms//set_08_31_10_t_11_15_02/scenario_0003_Tue_Aug_31_11_38_25_PDT_2010.html

        Here is the REJOIN event from node 9 that was reported.

        [#|2010-08-31T11:41:09.552-0700|WARNING|oracle-glassfish3.1|ShoalLogger|_ThreadID=15;_ThreadName=Thread-1;|Instance
        n1c1m4 was restarted at 11:41:58 AM PDT on Aug 31, 2010.|#]

        The above log event from node 9 is stating that instance "n1c1m4" was restarted
        49 seconds in the future. Current time on node 9 is 11:41:09 and machine
        running instance n1c1m4 was restarted at 11:41:58 in the future.
        There is at least a 49 second skew in clock time between node9 and the machine
        running n1c1m4. While we should be able to handle such a case, it is not ideal
        conditions to be investigating this issue under. Given that the timing in the
        test is in granularity of 15 and 20 second waits, the time skew should not be so
        large in these initial test runs. (unless we have a test scenario that is
        testing specifically how GMS fares when the clustered instances have significant
        time skew between them.)

        Examining time skew across clustered instances based on a common event.
        The FailureSuspected event is handled in node9 at time 11:40:50 and it
        is received in node8 at time 11:41:17.211, representing a skew of approximately
        27 seconds between these instances. The FailureSuspected event was sent by
        DAS at time 11:41:36.674 and received in node9 at 11:40:50, representing a skew
        of 46 seconds between master and node9.

        I would like to propose that this test run be considered invalid due to
        significant time skew between machines in cluster not being the functionality
        being tested. At this point, there is no way to infer if the time skew impacted
        the test or not, but it is a variable better off being eliminated from initial
        test runs.

        Show
        Joe Fialli added a comment - It is difficult to analyze this issue due to significant time skew between instances in the cluster. The following information is extracted from following file: http://aras2.sfbay.sun.com:8080/testresults/export1/gms/gf31/gms//set_08_31_10_t_11_15_02/scenario_0003_Tue_Aug_31_11_38_25_PDT_2010.html Here is the REJOIN event from node 9 that was reported. [#|2010-08-31T11:41:09.552-0700|WARNING|oracle-glassfish3.1|ShoalLogger|_ThreadID=15;_ThreadName=Thread-1;|Instance n1c1m4 was restarted at 11:41:58 AM PDT on Aug 31, 2010.|#] The above log event from node 9 is stating that instance "n1c1m4" was restarted 49 seconds in the future. Current time on node 9 is 11:41:09 and machine running instance n1c1m4 was restarted at 11:41:58 in the future. There is at least a 49 second skew in clock time between node9 and the machine running n1c1m4. While we should be able to handle such a case, it is not ideal conditions to be investigating this issue under. Given that the timing in the test is in granularity of 15 and 20 second waits, the time skew should not be so large in these initial test runs. (unless we have a test scenario that is testing specifically how GMS fares when the clustered instances have significant time skew between them.) Examining time skew across clustered instances based on a common event. The FailureSuspected event is handled in node9 at time 11:40:50 and it is received in node8 at time 11:41:17.211, representing a skew of approximately 27 seconds between these instances. The FailureSuspected event was sent by DAS at time 11:41:36.674 and received in node9 at 11:40:50, representing a skew of 46 seconds between master and node9. I would like to propose that this test run be considered invalid due to significant time skew between machines in cluster not being the functionality being tested. At this point, there is no way to infer if the time skew impacted the test or not, but it is a variable better off being eliminated from initial test runs.

          People

          • Assignee:
            Joe Fialli
            Reporter:
            zorro
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: