glassfish
  1. glassfish
  2. GLASSFISH-16877

[Blocking] Asadmin CLI timeout after long HA test execution

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.1.1_b07
    • Fix Version/s: 3.1.1_b10
    • Component/s: admin
    • Labels:
      None
    • Environment:

      AIX 6, JDK 6 J9 64bits

      Description

      Cluster with 10 instances was setup on 3 AIX machines. After continuous HA test execution for 14 hrs, the asadmin CLI hangs. Tried to do an "asadmin list-instances" and the command hangs originally and it timed out with following message:
      No response from Domain Admin Server after 600 seconds.
      The command is either taking too long to complete or the server has failed.
      Please see the server log files for command status.
      Command list-instances failed.

        Activity

        Hide
        Tom Mueller added a comment -

        An initial look at this problem revealed that of the 10 instances that were running, only instance106 was unresponsive. The process was up and appeared to accept an HTTP request, but it didn't respond. So the list-instances command waits forever until asadmin times out. The server.log file for the instance showed the following message repeated over and over:

        [#|2011-06-16T13:30:13.912-0700|SEVERE|glassfish3.1|ShoalLogger|_ThreadID=20;_ThreadName=Thread-6;|GMS1074: failure during multicast receive null|#]

        4 of these messages are being output by the server every 2 seconds. The was a server.log file and two rollover server.log files, but all three files had only this message. So there isn't any way to know what happened to the process before these messages started being generated.

        The process as killed before jstack output could be obtained from the process. This will be needed to further diagnose this problem.

        Assigning back to Ming.

        Show
        Tom Mueller added a comment - An initial look at this problem revealed that of the 10 instances that were running, only instance106 was unresponsive. The process was up and appeared to accept an HTTP request, but it didn't respond. So the list-instances command waits forever until asadmin times out. The server.log file for the instance showed the following message repeated over and over: [#|2011-06-16T13:30:13.912-0700|SEVERE|glassfish3.1|ShoalLogger|_ThreadID=20;_ThreadName=Thread-6;|GMS1074: failure during multicast receive null|#] 4 of these messages are being output by the server every 2 seconds. The was a server.log file and two rollover server.log files, but all three files had only this message. So there isn't any way to know what happened to the process before these messages started being generated. The process as killed before jstack output could be obtained from the process. This will be needed to further diagnose this problem. Assigning back to Ming.
        Hide
        sherryshen added a comment -

        Thank Jeseph Fiali for helping us to
        look at failure.
        The machine aixas7 will be reconfigured for further ha tests.

        Here is a record of the analysis.
        1) The ha tests are executed on a cluster of 10 instances on 3 aix machines (aixas7, aixas10, aixas12). The tests hanging on aixas7 during ha execution on b07 and b08 in http module.

        2) I reran at one http suite (SLSB, the first suite in the http module) and observed random failure. The das server.log shows that the instance 106
        at aixas7 is on different sub-net (10.133.168)from the other 2 machines (aixas10, aixas12, 10.133.169).
        http://agni-1.us.oracle.com/asqe-logs/export1/3.1.1/Results/build08/ha/test1/run1/results/http/com.sun.dft.glassfish.http.failover.ejbref.SLSB/EJBRefFOTest4/logs/st-domain/logs/server.log

        [#|2011-06-20T15:50:47.511-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=10;_ThreadName=Thread-8;|GMS1092: GMS View Change Received for group: st-cluster : Members in view for FAILURE_EVENT(before change analysis) are :
        1: MemberId: instance101, MemberType: CORE, Address: 10.133.169.49:9173:228.9.157.39:21530:st-cluster:instance101
        2: MemberId: instance102, MemberType: CORE, Address: 10.133.169.51:9186:228.9.157.39:21530:st-cluster:instance102
        3: MemberId: instance104, MemberType: CORE, Address: 10.133.169.49:9093:228.9.157.39:21530:st-cluster:instance104
        4: MemberId: instance105, MemberType: CORE, Address: 10.133.169.51:9178:228.9.157.39:21530:st-cluster:instance105
        5: MemberId: instance106, MemberType: CORE, Address: 10.133.168.113:9111:228.9.157.39:21530:st-cluster:instance106
        6: MemberId: instance108, MemberType: CORE, Address: 10.133.169.51:9176:228.9.157.39:21530:st-cluster:instance108
        7: MemberId: server, MemberType: SPECTATOR, Address: 10.133.169.49:9161:228.9.157.39:21530:st-cluster:server

        #]
        root@aixas10% asadmin get-health st-cluster
        instance101 rejoined since Mon Jun 20 16:35:33 PDT 2011
        instance102 started since Tue Jun 21 00:13:35 PDT 2011
        instance103 started since Tue Jun 21 00:14:00 PDT 2011
        instance104 started since Tue Jun 21 00:13:30 PDT 2011
        instance105 rejoined since Mon Jun 20 16:38:04 PDT 2011
        instance106 failed since Mon Jun 20 23:09:59 PDT 2011
        instance107 rejoined since Mon Jun 20 16:37:03 PDT 2011
        instance108 started since Tue Jun 21 00:13:25 PDT 2011
        instance109 failed since Tue Jun 21 02:22:24 PDT 2011
        instance110 failed since Tue Jun 21 02:22:24 PDT 2011
        Command get-health executed successfully.
        root@aixas10% date
        Tue Jun 21 06:33:21 PDT 2011
        root@aixas10%
        root@aixas10% asadmin list-instances -l
        NAME HOST PORT PID CLUSTER STATE
        instance101 aixas10.us.oracle.com 24848 524414 st-cluster running
        instance102 aixas12.us.oracle.com 24848 421924 st-cluster running
        instance103 aixas7.us.oracle.com 24848 409786 st-cluster running
        instance104 aixas10.us.oracle.com 24849 352308 st-cluster running
        instance105 aixas12.us.oracle.com 24849 131188 st-cluster running
        instance106 aixas7.us.oracle.com 24849 454746 st-cluster running
        instance107 aixas10.us.oracle.com 24850 450802 st-cluster running
        instance108 aixas12.us.oracle.com 24850 372896 st-cluster running
        instance109 aixas7.us.oracle.com 24850 483572 st-cluster running
        instance110 aixas7.us.oracle.com 24851 368746 st-cluster running
        Command list-instances executed successfully.
        root@aixas10%
        103, 106, 109, 110 are on aixas7.
        At this morning, 103 are started, the other 3 failed.

        3) multicast works between aixas10 and aixas12,

        root@aixas12% asadmin validate-multicast
        Will use port 2048
        Will use address 228.9.3.1
        Will use bind interface null
        Will use wait period 2,000 (in milliseconds)

        Listening for data...
        Sending message with content "aixas12.us.oracle.com" every 2,000 milliseconds
        Received data from aixas12.us.oracle.com (loopback)
        Received data from aixas10.us.oracle.com
        Exiting after 20 seconds. To change this timeout, use the --timeout command line option.
        Command validate-multicast executed successfully.
        root@aixas12%

        root@aixas10% asadmin validate-multicast
        Will use port 2048
        Will use address 228.9.3.1
        Will use bind interface null
        Will use wait period 2,000 (in milliseconds)

        Listening for data...
        Sending message with content "aixas10.us.oracle.com" every 2,000 milliseconds
        Received data from aixas10.us.oracle.com (loopback)
        Received data from aixas12.us.oracle.com
        Exiting after 20 seconds. To change this timeout, use the --timeout command line option.
        Command validate-multicast executed successfully.
        root@aixas10%

        4) multicast doesn't work between aixas7 and aixas10.

        root@aixas7% asadmin validate-multicast
        Will use port 2048
        Will use address 228.9.3.1
        Will use bind interface null
        Will use wait period 2,000 (in milliseconds)

        Listening for data...
        Sending message with content "aixas7.us.oracle.com" every 2,000 milliseconds
        Received data from aixas7.us.oracle.com (loopback)

        Exiting after 20 seconds. To change this timeout, use the --timeout command line option.
        Command validate-multicast executed successfully.
        root@aixas7%

        root@aixas10% asadmin validate-multicast
        Will use port 2048
        Will use address 228.9.3.1
        Will use bind interface null
        Will use wait period 2,000 (in milliseconds)

        Listening for data...
        Sending message with content "aixas10.us.oracle.com" every 2,000 milliseconds
        Received data from aixas10.us.oracle.com (loopback)
        Exiting after 20 seconds. To change this timeout, use the --timeout command line option.
        Command validate-multicast executed successfully.
        root@aixas10%

        Show
        sherryshen added a comment - Thank Jeseph Fiali for helping us to look at failure. The machine aixas7 will be reconfigured for further ha tests. Here is a record of the analysis. 1) The ha tests are executed on a cluster of 10 instances on 3 aix machines (aixas7, aixas10, aixas12). The tests hanging on aixas7 during ha execution on b07 and b08 in http module. 2) I reran at one http suite (SLSB, the first suite in the http module) and observed random failure. The das server.log shows that the instance 106 at aixas7 is on different sub-net (10.133.168)from the other 2 machines (aixas10, aixas12, 10.133.169). http://agni-1.us.oracle.com/asqe-logs/export1/3.1.1/Results/build08/ha/test1/run1/results/http/com.sun.dft.glassfish.http.failover.ejbref.SLSB/EJBRefFOTest4/logs/st-domain/logs/server.log [#|2011-06-20T15:50:47.511-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=10;_ThreadName=Thread-8;|GMS1092: GMS View Change Received for group: st-cluster : Members in view for FAILURE_EVENT(before change analysis) are : 1: MemberId: instance101, MemberType: CORE, Address: 10.133.169.49:9173:228.9.157.39:21530:st-cluster:instance101 2: MemberId: instance102, MemberType: CORE, Address: 10.133.169.51:9186:228.9.157.39:21530:st-cluster:instance102 3: MemberId: instance104, MemberType: CORE, Address: 10.133.169.49:9093:228.9.157.39:21530:st-cluster:instance104 4: MemberId: instance105, MemberType: CORE, Address: 10.133.169.51:9178:228.9.157.39:21530:st-cluster:instance105 5: MemberId: instance106, MemberType: CORE, Address: 10.133.168.113:9111:228.9.157.39:21530:st-cluster:instance106 6: MemberId: instance108, MemberType: CORE, Address: 10.133.169.51:9176:228.9.157.39:21530:st-cluster:instance108 7: MemberId: server, MemberType: SPECTATOR, Address: 10.133.169.49:9161:228.9.157.39:21530:st-cluster:server #] root@aixas10% asadmin get-health st-cluster instance101 rejoined since Mon Jun 20 16:35:33 PDT 2011 instance102 started since Tue Jun 21 00:13:35 PDT 2011 instance103 started since Tue Jun 21 00:14:00 PDT 2011 instance104 started since Tue Jun 21 00:13:30 PDT 2011 instance105 rejoined since Mon Jun 20 16:38:04 PDT 2011 instance106 failed since Mon Jun 20 23:09:59 PDT 2011 instance107 rejoined since Mon Jun 20 16:37:03 PDT 2011 instance108 started since Tue Jun 21 00:13:25 PDT 2011 instance109 failed since Tue Jun 21 02:22:24 PDT 2011 instance110 failed since Tue Jun 21 02:22:24 PDT 2011 Command get-health executed successfully. root@aixas10% date Tue Jun 21 06:33:21 PDT 2011 root@aixas10% root@aixas10% asadmin list-instances -l NAME HOST PORT PID CLUSTER STATE instance101 aixas10.us.oracle.com 24848 524414 st-cluster running instance102 aixas12.us.oracle.com 24848 421924 st-cluster running instance103 aixas7.us.oracle.com 24848 409786 st-cluster running instance104 aixas10.us.oracle.com 24849 352308 st-cluster running instance105 aixas12.us.oracle.com 24849 131188 st-cluster running instance106 aixas7.us.oracle.com 24849 454746 st-cluster running instance107 aixas10.us.oracle.com 24850 450802 st-cluster running instance108 aixas12.us.oracle.com 24850 372896 st-cluster running instance109 aixas7.us.oracle.com 24850 483572 st-cluster running instance110 aixas7.us.oracle.com 24851 368746 st-cluster running Command list-instances executed successfully. root@aixas10% 103, 106, 109, 110 are on aixas7. At this morning, 103 are started, the other 3 failed. 3) multicast works between aixas10 and aixas12, root@aixas12% asadmin validate-multicast Will use port 2048 Will use address 228.9.3.1 Will use bind interface null Will use wait period 2,000 (in milliseconds) Listening for data... Sending message with content "aixas12.us.oracle.com" every 2,000 milliseconds Received data from aixas12.us.oracle.com (loopback) Received data from aixas10.us.oracle.com Exiting after 20 seconds. To change this timeout, use the --timeout command line option. Command validate-multicast executed successfully. root@aixas12% root@aixas10% asadmin validate-multicast Will use port 2048 Will use address 228.9.3.1 Will use bind interface null Will use wait period 2,000 (in milliseconds) Listening for data... Sending message with content "aixas10.us.oracle.com" every 2,000 milliseconds Received data from aixas10.us.oracle.com (loopback) Received data from aixas12.us.oracle.com Exiting after 20 seconds. To change this timeout, use the --timeout command line option. Command validate-multicast executed successfully. root@aixas10% 4) multicast doesn't work between aixas7 and aixas10. root@aixas7% asadmin validate-multicast Will use port 2048 Will use address 228.9.3.1 Will use bind interface null Will use wait period 2,000 (in milliseconds) Listening for data... Sending message with content "aixas7.us.oracle.com" every 2,000 milliseconds Received data from aixas7.us.oracle.com (loopback) Exiting after 20 seconds. To change this timeout, use the --timeout command line option. Command validate-multicast executed successfully. root@aixas7% root@aixas10% asadmin validate-multicast Will use port 2048 Will use address 228.9.3.1 Will use bind interface null Will use wait period 2,000 (in milliseconds) Listening for data... Sending message with content "aixas10.us.oracle.com" every 2,000 milliseconds Received data from aixas10.us.oracle.com (loopback) Exiting after 20 seconds. To change this timeout, use the --timeout command line option. Command validate-multicast executed successfully. root@aixas10%
        Hide
        sherryshen added a comment -

        Update of machine re-configuration and futher tests with hanging.

        1) The machine of aixas7 was connected to the same switch as
        the other two machines (aixas10, aixas12).
        The re-ip of aixas7 was not done yet.
        Then, validate-multicast of aixas7 works with others.

        2) The ha lb tests are hanging at das with b08.
        Varun took the thread dump and sent to Tom.

        3) The re-ip of aixas7 is done so that all 3 machines are in 10.133.169 sub-net.

        4) The ha http tests are hanging at das with b08.

        5) For the hanging in 4), Tom took a thread dump of the DAS running on aixas10 and
        put it in /tmp/dashang.out. There seems to be an unusual number of threads active in this DAS.

        6) Joe identified two problems and filed bugs:

        http://java.net/jira/browse/GLASSFISH-16909
        start-cluster/stop-cluster leaks threads

        http://java.net/jira/browse/GLASSFISH-16910
        create-instance leaks one thread

        Joe had a fix for the first bug (16909) and supposed that
        it could cause the hang if we are running out of resources.
        He think a number of the threads in Tom's thread dump resulted from
        this bug.

        The second bug is less severe – Toms thread dump had significantly
        fewer of these.

        Show
        sherryshen added a comment - Update of machine re-configuration and futher tests with hanging. 1) The machine of aixas7 was connected to the same switch as the other two machines (aixas10, aixas12). The re-ip of aixas7 was not done yet. Then, validate-multicast of aixas7 works with others. 2) The ha lb tests are hanging at das with b08. Varun took the thread dump and sent to Tom. 3) The re-ip of aixas7 is done so that all 3 machines are in 10.133.169 sub-net. 4) The ha http tests are hanging at das with b08. 5) For the hanging in 4), Tom took a thread dump of the DAS running on aixas10 and put it in /tmp/dashang.out. There seems to be an unusual number of threads active in this DAS. 6) Joe identified two problems and filed bugs: http://java.net/jira/browse/GLASSFISH-16909 start-cluster/stop-cluster leaks threads http://java.net/jira/browse/GLASSFISH-16910 create-instance leaks one thread Joe had a fix for the first bug (16909) and supposed that it could cause the hang if we are running out of resources. He think a number of the threads in Tom's thread dump resulted from this bug. The second bug is less severe – Toms thread dump had significantly fewer of these.
        Hide
        scatari added a comment -

        Can we close this as a dup of http://java.net/jira/browse/GLASSFISH-16909?

        Show
        scatari added a comment - Can we close this as a dup of http://java.net/jira/browse/GLASSFISH-16909?
        Hide
        mzh777 added a comment -

        This issue is probably due to the fact that aixas7 was not on the same sub net as aixas10 and aixas12. The aixas7 was moved to the same subnet since then. You can close the bug as fixed but not as a dup of 16909. We will have a chance to verify the fix on next HA execution.

        Show
        mzh777 added a comment - This issue is probably due to the fact that aixas7 was not on the same sub net as aixas10 and aixas12. The aixas7 was moved to the same subnet since then. You can close the bug as fixed but not as a dup of 16909. We will have a chance to verify the fix on next HA execution.

          People

          • Assignee:
            mzh777
            Reporter:
            mzh777
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: