glassfish
  1. glassfish
  2. GLASSFISH-16769

RichAccess stress test on OEL with JRockit. Many errors in server.log files, starting with: LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.1_b06
    • Fix Version/s: 3.1.1_b07
    • Component/s: failover
    • Labels:
      None

      Description

      OEL machines with JRockit jdk.

      Where used three machines, one clustered instance per a machine. Started richAccess stress test, 100 virtual users per an instance.

      After about 18 hours of running in the instances server.log files appeared the follow messages:
      =======================================================
      [#|2011-05-27T10:37:45.981-0700|WARNING|glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=26;_ThreadName=Thread-3;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]
      ------------------------------------------------------

      After that many other error messages were seen in the server.log files. The gc logs were OK. I've attached server.log files (for instance2 was attached first server.log file).

      1. server.log_in1
        1.91 MB
        easarina
      2. server.log_in2
        1.91 MB
        easarina
      3. server.log_in3
        930 kB
        easarina

        Activity

        Hide
        Joe Fialli added a comment -

        Note: based on submitted server logs. The machine running instance1 has a clock skew of 53 minutes behind the other 2 server instances. First line in each server.log when application server was started.

        server.log_in1: May 26, 2011 4:04:52 PM com.sun.enterprise.admin.launcher.GFLauncherLogger info
        server.log_in2: May 26, 2011 4:57:39 PM com.sun.enterprise.admin.launcher.GFLauncherLogger info
        server.log_in3: May 26, 2011 4:57:49 PM com.sun.enterprise.admin.launcher.GFLauncherLogger info

        The above are the start times of the 3 instances; however, GMS log shows all of them running at same time in in1 log.
        [#|2011-05-26T16:04:58.610-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1092: GMS View Change Received for group: c1 : Members in view for ADD_EVENT(before change analysis) are :
        1: MemberId: in1, MemberType: CORE, Address: 10.133.184.161:9113:228.9.58.215:16561:c1:in1
        2: MemberId: in2, MemberType: CORE, Address: 10.133.184.162:9182:228.9.58.215:16561:c1:in2
        3: MemberId: in3, MemberType: CORE, Address: 10.133.184.163:9178:228.9.58.215:16561:c1:in3
        4: MemberId: server, MemberType: SPECTATOR, Address: 10.133.184.161:9124:228.9.58.215:16561:c1:server

        #]

        The above could only happen if the 3 instances were really started at same time AND the clock
        was different on machine1 running in1 from machine 2 and 3 that were running in2 and in3.
        Uncertain if this skew could result in premature expiration of http or ejb session data since I am
        unaware of the life span of session data in richaccess. Someone who understands test more needs to evaluate
        if this skew could have caused the load request failure in future. There is not enough logging info to know
        if any expiration of session data is what is causing failed load request or not.

        *********

        Uncertain if there is an issue in GMS or in HA. However, all the log messages are in HA access to EJB so for time being will assign the component to failover AND be sure that I am on watch list for this issue. Information that is needed to evaluate this issue more.

        1. DAS logs. Just need to be certain that GMS has not incorrectly reported any instance as failed. From submitted logs,
        there was no GMS notifications that any instance left via FAILURE or PLANNED_SHUTDOWN.

        2. loadbalancer logs. Just to see if the loadbalancer decided at any point that an instance was being non-responsive and it unilaterally changed the server to which to forward http requests for a specific http session.

        ***********

        GMS evaluation:

        Grepped logs for current cluster member view that is used by HA. They looked quite stable. Here are results.
        All instances were aware of each other and that state did not change during the run. (accounting for the
        clock skew reported above, all instances were aware of each other within a minute of the application server starting
        and there were no GMS view changes after the first minute of startup. Further confirming that the issue is not GMS issue as far as current instance members. Still uncertain if GMS message delivery failed causing the timeout or if there is no response when a load is requested for an ejb that does not exist. Thus, there may be no reply in that case.

        dhcp-burlington9-3rd-a-east-10-152-23-224:richaccess jf39279$ grep current: server.log_in1
        [#|2011-05-26T16:06:06.076-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-3;|**VIEW: prevViewId: 0; curViewID: 1; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a379bca [current: in3] [previous: ]|#]
        [#|2011-05-26T16:06:06.173-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-3;|**VIEW: prevViewId: 1; curViewID: 2; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a076943 [current: in1, in3] [previous: in3]|#]
        [#|2011-05-26T16:06:06.536-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=24;_ThreadName=Thread-3;|**VIEW: prevViewId: 2; curViewID: 3; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a2051c9 [current: in1, in2, in3] [previous: ]|#]

        dhcp-burlington9-3rd-a-east-10-152-23-224:richaccess jf39279$ grep current: server.log_in2
        [#|2011-05-26T16:59:02.124-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-3;|**VIEW: prevViewId: 2; curViewID: 3; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a128e95 [current: in1, in2, in3] [previous: ]|#]

        dhcp-burlington9-3rd-a-east-10-152-23-224:richaccess jf39279$ grep current: server.log_in3
        [#|2011-05-26T16:58:51.707-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-3;|**VIEW: prevViewId: 0; curViewID: 1; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a81c36a [current: in3] [previous: ]|#]
        [#|2011-05-26T16:58:51.781-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=19;_ThreadName=Thread-3;|**VIEW: prevViewId: 1; curViewID: 2; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a075f7c [current: in1, in3] [previous: in3]|#]
        [#|2011-05-26T16:58:52.131-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-3;|**VIEW: prevViewId: 2; curViewID: 3; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a0b07c5 [current: in1, in2, in3] [previous: ]|#]

        Show
        Joe Fialli added a comment - Note: based on submitted server logs. The machine running instance1 has a clock skew of 53 minutes behind the other 2 server instances. First line in each server.log when application server was started. server.log_in1: May 26, 2011 4:04:52 PM com.sun.enterprise.admin.launcher.GFLauncherLogger info server.log_in2: May 26, 2011 4:57:39 PM com.sun.enterprise.admin.launcher.GFLauncherLogger info server.log_in3: May 26, 2011 4:57:49 PM com.sun.enterprise.admin.launcher.GFLauncherLogger info The above are the start times of the 3 instances; however, GMS log shows all of them running at same time in in1 log. [#|2011-05-26T16:04:58.610-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1092: GMS View Change Received for group: c1 : Members in view for ADD_EVENT(before change analysis) are : 1: MemberId: in1, MemberType: CORE, Address: 10.133.184.161:9113:228.9.58.215:16561:c1:in1 2: MemberId: in2, MemberType: CORE, Address: 10.133.184.162:9182:228.9.58.215:16561:c1:in2 3: MemberId: in3, MemberType: CORE, Address: 10.133.184.163:9178:228.9.58.215:16561:c1:in3 4: MemberId: server, MemberType: SPECTATOR, Address: 10.133.184.161:9124:228.9.58.215:16561:c1:server #] The above could only happen if the 3 instances were really started at same time AND the clock was different on machine1 running in1 from machine 2 and 3 that were running in2 and in3. Uncertain if this skew could result in premature expiration of http or ejb session data since I am unaware of the life span of session data in richaccess. Someone who understands test more needs to evaluate if this skew could have caused the load request failure in future. There is not enough logging info to know if any expiration of session data is what is causing failed load request or not. ********* Uncertain if there is an issue in GMS or in HA. However, all the log messages are in HA access to EJB so for time being will assign the component to failover AND be sure that I am on watch list for this issue. Information that is needed to evaluate this issue more. 1. DAS logs. Just need to be certain that GMS has not incorrectly reported any instance as failed. From submitted logs, there was no GMS notifications that any instance left via FAILURE or PLANNED_SHUTDOWN. 2. loadbalancer logs. Just to see if the loadbalancer decided at any point that an instance was being non-responsive and it unilaterally changed the server to which to forward http requests for a specific http session. *********** GMS evaluation: Grepped logs for current cluster member view that is used by HA. They looked quite stable. Here are results. All instances were aware of each other and that state did not change during the run. (accounting for the clock skew reported above, all instances were aware of each other within a minute of the application server starting and there were no GMS view changes after the first minute of startup. Further confirming that the issue is not GMS issue as far as current instance members. Still uncertain if GMS message delivery failed causing the timeout or if there is no response when a load is requested for an ejb that does not exist. Thus, there may be no reply in that case. dhcp-burlington9-3rd-a-east-10-152-23-224:richaccess jf39279$ grep current: server.log_in1 [#|2011-05-26T16:06:06.076-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-3;|**VIEW: prevViewId: 0; curViewID: 1; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a379bca [current: in3] [previous: ] |#] [#|2011-05-26T16:06:06.173-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-3;|**VIEW: prevViewId: 1; curViewID: 2; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a076943 [current: in1, in3] [previous: in3] |#] [#|2011-05-26T16:06:06.536-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=24;_ThreadName=Thread-3;|**VIEW: prevViewId: 2; curViewID: 3; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a2051c9 [current: in1, in2, in3] [previous: ] |#] dhcp-burlington9-3rd-a-east-10-152-23-224:richaccess jf39279$ grep current: server.log_in2 [#|2011-05-26T16:59:02.124-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-3;|**VIEW: prevViewId: 2; curViewID: 3; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a128e95 [current: in1, in2, in3] [previous: ] |#] dhcp-burlington9-3rd-a-east-10-152-23-224:richaccess jf39279$ grep current: server.log_in3 [#|2011-05-26T16:58:51.707-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-3;|**VIEW: prevViewId: 0; curViewID: 1; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a81c36a [current: in3] [previous: ] |#] [#|2011-05-26T16:58:51.781-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=19;_ThreadName=Thread-3;|**VIEW: prevViewId: 1; curViewID: 2; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a075f7c [current: in1, in3] [previous: in3] |#] [#|2011-05-26T16:58:52.131-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-3;|**VIEW: prevViewId: 2; curViewID: 3; signal: com.sun.enterprise.ee.cms.impl.common.JoinedAndReadyNotificationSignalImpl@1a0b07c5 [current: in1, in2, in3] [previous: ] |#]
        Hide
        Joe Fialli added a comment -

        Additional info request.

        Has this test passed in past with such a time skew between machines?
        Has the richaccess test passed when run with an alternative jdk than jrockit?

        Show
        Joe Fialli added a comment - Additional info request. Has this test passed in past with such a time skew between machines? Has the richaccess test passed when run with an alternative jdk than jrockit?
        Hide
        Joe Fialli added a comment -

        looking for Mahesh's input on loadrequest timeouts and failure to find ejbs sessions.

        Show
        Joe Fialli added a comment - looking for Mahesh's input on loadrequest timeouts and failure to find ejbs sessions.
        Hide
        easarina added a comment -

        I also have richAccess stress run on AIX with IBM jdk, I did not see such failures for that run.

        Show
        easarina added a comment - I also have richAccess stress run on AIX with IBM jdk, I did not see such failures for that run.
        Hide
        Mahesh Kannan added a comment -

        Checked with Elena. she doesn't see this anymore.

        Show
        Mahesh Kannan added a comment - Checked with Elena. she doesn't see this anymore.
        Hide
        Mahesh Kannan added a comment -

        Marking the fix version. Elena couldn't reproduce this

        Show
        Mahesh Kannan added a comment - Marking the fix version. Elena couldn't reproduce this

          People

          • Assignee:
            Mahesh Kannan
            Reporter:
            easarina
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: