It is difficult to analyze this issue due to significant time skew between
instances in the cluster.
The following information is extracted from following file:
Here is the REJOIN event from node 9 that was reported.
n1c1m4 was restarted at 11:41:58 AM PDT on Aug 31, 2010.|#]
The above log event from node 9 is stating that instance "n1c1m4" was restarted
49 seconds in the future. Current time on node 9 is 11:41:09 and machine
running instance n1c1m4 was restarted at 11:41:58 in the future.
There is at least a 49 second skew in clock time between node9 and the machine
running n1c1m4. While we should be able to handle such a case, it is not ideal
conditions to be investigating this issue under. Given that the timing in the
test is in granularity of 15 and 20 second waits, the time skew should not be so
large in these initial test runs. (unless we have a test scenario that is
testing specifically how GMS fares when the clustered instances have significant
time skew between them.)
Examining time skew across clustered instances based on a common event.
The FailureSuspected event is handled in node9 at time 11:40:50 and it
is received in node8 at time 11:41:17.211, representing a skew of approximately
27 seconds between these instances. The FailureSuspected event was sent by
DAS at time 11:41:36.674 and received in node9 at 11:40:50, representing a skew
of 46 seconds between master and node9.
I would like to propose that this test run be considered invalid due to
significant time skew between machines in cluster not being the functionality
being tested. At this point, there is no way to infer if the time skew impacted
the test or not, but it is a variable better off being eliminated from initial