glassfish
  1. glassfish
  2. GLASSFISH-17116

list-instances lets asadmin timeout when an instance is hung

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.1
    • Fix Version/s: 3.1.2_b15, 4.0_b15
    • Component/s: admin
    • Labels:
      None

      Description

      When an instance is hung, then the list-instances command hangs too until asadmin finally times out after 600 seconds.

      The reason for this is that list-instances (via InstanceState) uses InstanceCommandExecutor to run the __locations command on the instance. This class runs the command without any timeout.

      To fix this bug, the connection to the instance should timeout after some reasonable interval (less than the time asadmin is waiting).

      This issue is being raised due to hang problems that have been experienced with AIX testing. With these hangs, it is possible to initiate a TCP connection to the process, but the connection attempt just hangs; it isn't processed and it isn't refused. To simulate this, set a breakpoint in the __locations command of the instance and see what list-instances does.

      The desirable output from list-instances in this situation is that the state of the instance would be reported as "non-responsive" or "hung".

        Issue Links

          Activity

          Tom Mueller created issue -
          Hide
          sherryshen added a comment - - edited

          I raised an asadmin cli question from the tests in
          http://java.net/jira/browse/GLASSFISH-16960
          When one instance is hanging and another instance is killed or stopped in
          "asadmin list-instances" gives timeout message,
          "asadmin get-health st-cluster" gives 2 instances in failed status,
          any way to report hanging status to help user to understand the problem?

          Thank Tom for filing the bug,
          http://java.net/jira/browse/GLASSFISH-17116
          Its fix will help user to understand the status of instances.

          Show
          sherryshen added a comment - - edited I raised an asadmin cli question from the tests in http://java.net/jira/browse/GLASSFISH-16960 When one instance is hanging and another instance is killed or stopped in "asadmin list-instances" gives timeout message, "asadmin get-health st-cluster" gives 2 instances in failed status, any way to report hanging status to help user to understand the problem? Thank Tom for filing the bug, http://java.net/jira/browse/GLASSFISH-17116 Its fix will help user to understand the status of instances.
          Hide
          sherryshen added a comment -

          With the same hanging instance101 on aixas10,
          glassfish treats the hanging instance as a running instance
          in start-instance.
          Can start-instance give a different message of hanging in
          comparing with a normal instance?

          bash-3.2# asadmin get-health st-cluster
          instance101 failed since Mon Jul 25 17:38:44 PDT 2011
          instance102 started since Tue Jul 26 17:14:02 PDT 2011
          instance103 started since Tue Jul 26 17:14:19 PDT 2011
          instance104 started since Wed Jul 27 06:54:17 PDT 2011
          instance105 started since Tue Jul 26 17:14:03 PDT 2011
          instance106 started since Tue Jul 26 17:14:19 PDT 2011
          instance107 started since Thu Jul 28 10:34:59 PDT 2011
          instance108 started since Tue Jul 26 17:14:09 PDT 2011
          instance109 started since Tue Jul 26 17:14:17 PDT 2011
          instance110 started since Tue Jul 26 17:14:20 PDT 2011
          Command get-health executed successfully.
          bash-3.2# asadmin start-instance instance101
          Instance instance101 is already running.
          Command start-instance executed successfully.
          bash-3.2# asadmin list-instances instance101
          No response from Domain Admin Server after 600 seconds.
          The command is either taking too long to complete or the server has failed.
          Please see the server log files for command status.
          Command list-instances failed.
          bash-3.2# date
          Thu Jul 28 17:50:26 PDT 2011
          bash-3.2#

          Show
          sherryshen added a comment - With the same hanging instance101 on aixas10, glassfish treats the hanging instance as a running instance in start-instance. Can start-instance give a different message of hanging in comparing with a normal instance? bash-3.2# asadmin get-health st-cluster instance101 failed since Mon Jul 25 17:38:44 PDT 2011 instance102 started since Tue Jul 26 17:14:02 PDT 2011 instance103 started since Tue Jul 26 17:14:19 PDT 2011 instance104 started since Wed Jul 27 06:54:17 PDT 2011 instance105 started since Tue Jul 26 17:14:03 PDT 2011 instance106 started since Tue Jul 26 17:14:19 PDT 2011 instance107 started since Thu Jul 28 10:34:59 PDT 2011 instance108 started since Tue Jul 26 17:14:09 PDT 2011 instance109 started since Tue Jul 26 17:14:17 PDT 2011 instance110 started since Tue Jul 26 17:14:20 PDT 2011 Command get-health executed successfully. bash-3.2# asadmin start-instance instance101 Instance instance101 is already running. Command start-instance executed successfully. bash-3.2# asadmin list-instances instance101 No response from Domain Admin Server after 600 seconds. The command is either taking too long to complete or the server has failed. Please see the server log files for command status. Command list-instances failed. bash-3.2# date Thu Jul 28 17:50:26 PDT 2011 bash-3.2#
          Hide
          scatari added a comment -

          Please evaluate this as for possible inclusion into 3.1.2.

          Show
          scatari added a comment - Please evaluate this as for possible inclusion into 3.1.2.
          scatari made changes -
          Field Original Value New Value
          Tags 3_1_2-review
          Hide
          Byron Nevins added a comment -

          This is in Vijay's code.

          Is there someone that took ownership of Vijay's code?

          Show
          Byron Nevins added a comment - This is in Vijay's code. Is there someone that took ownership of Vijay's code?
          Hide
          Byron Nevins added a comment -

          Very interesting. 2 hours to find this because who would expect it. What's wrong with this line of code?

          InstanceCommandResult r = future.get(timeoutInMsec, TimeUnit.SECONDS)

          It would time out. In 2000 seconds. Which is more than 30 minutes.

          You can even test this w/o the fix like so:

          list-instances --timeoutmsec 3 (for a 3-second timeout)

          Show
          Byron Nevins added a comment - Very interesting. 2 hours to find this because who would expect it. What's wrong with this line of code? InstanceCommandResult r = future.get(timeoutInMsec, TimeUnit.SECONDS) It would time out. In 2000 seconds. Which is more than 30 minutes. You can even test this w/o the fix like so: list-instances --timeoutmsec 3 (for a 3-second timeout)
          Hide
          Byron Nevins added a comment - - edited
          • What is the impact on the customer of the bug?

          Apparently a big deal for customers using AIX. Also annoying if any instance is in the Zombie state.

          How likely is it that a customer will see the bug and how serious is the bug?

          Not likely unless they have a Zombie instance.
          Apparently common on AIX

          Is it a regression? Does it meet other bug fix criteria (security, performance, etc.)?
          Yes. I wrote code that had a 2000 msec timeout. Someone changed that to 2000 seconds.

          • What is the cost/risk of fixing the bug?
            As close to zero as one can get

          How risky is the fix? How much work is the fix? Is the fix complicated?
          Very little work. Very very simple. No risk.

          • Is there an impact on documentation or message strings?
            No.
          • Which tests should QA (re)run to verify the fix did not destabilize GlassFish?
            General tests that list instances. They have no tests to exercise this. You need to be able to "hang" an instance to see the bug. For us developers it's trivial -->

          add a Thread.sleep() in __locations based on whether an env. variable is set

          • Which is the targeted build of 3.1.2 for this fix?
            B15
          Show
          Byron Nevins added a comment - - edited What is the impact on the customer of the bug? Apparently a big deal for customers using AIX. Also annoying if any instance is in the Zombie state. How likely is it that a customer will see the bug and how serious is the bug? Not likely unless they have a Zombie instance. Apparently common on AIX Is it a regression? Does it meet other bug fix criteria (security, performance, etc.)? Yes. I wrote code that had a 2000 msec timeout. Someone changed that to 2000 seconds. What is the cost/risk of fixing the bug? As close to zero as one can get How risky is the fix? How much work is the fix? Is the fix complicated? Very little work. Very very simple. No risk. Is there an impact on documentation or message strings? No. Which tests should QA (re)run to verify the fix did not destabilize GlassFish? General tests that list instances. They have no tests to exercise this. You need to be able to "hang" an instance to see the bug. For us developers it's trivial --> add a Thread.sleep() in __locations based on whether an env. variable is set Which is the targeted build of 3.1.2 for this fix? B15
          Hide
          Byron Nevins added a comment -

          Since it is so trivial, and instructive, here's the fix:

          BEFORE:
          InstanceCommandResult r = future.get(timeoutInMsec, TimeUnit.SECONDS);

          AFTER:
          InstanceCommandResult r = future.get(timeoutInMsec, TimeUnit.MILLISECONDS);

          Show
          Byron Nevins added a comment - Since it is so trivial, and instructive, here's the fix: BEFORE: InstanceCommandResult r = future.get(timeoutInMsec, TimeUnit.SECONDS); AFTER: InstanceCommandResult r = future.get(timeoutInMsec, TimeUnit.MILLISECONDS);
          Byron Nevins made changes -
          Fix Version/s 4.0_b15 [ 14801 ]
          Fix Version/s 3.1.2_b14 [ 15328 ]
          Hide
          Byron Nevins added a comment -

          Here is the checkin to 4.0

          Waiting for approval for 3.1.2

          d:\gf\branches\3.1.2\cluster>svn commit D:/gf/trunk/main/nucleus/cluster/common/src/main/java/com/sun/enterprise/util/cluste
          Sending D:\gf\trunk\main\nucleus\cluster\common\src\main\java\com\sun\enterprise\util\cluster\InstanceInfo.java
          Transmitting file data .
          Committed revision 51569.

          Show
          Byron Nevins added a comment - Here is the checkin to 4.0 Waiting for approval for 3.1.2 d:\gf\branches\3.1.2\cluster>svn commit D:/gf/trunk/main/nucleus/cluster/common/src/main/java/com/sun/enterprise/util/cluste Sending D:\gf\trunk\main\nucleus\cluster\common\src\main\java\com\sun\enterprise\util\cluster\InstanceInfo.java Transmitting file data . Committed revision 51569.
          Joe Di Pol made changes -
          Tags 3_1_2-review 3_1_2-approved
          Joe Di Pol made changes -
          Fix Version/s 3.1.2_b15 [ 15329 ]
          Fix Version/s 3.1.2_b14 [ 15328 ]
          Hide
          Byron Nevins added a comment -

          Checked into 3.1.2 branch:

          d:\gf\branches\3.1.2\cluster>svn commit common\src\main\java\com\sun\enterprise\util\cluster\InstanceInfo.java
          Sending common\src\main\java\com\sun\enterprise\util\cluster\InstanceInfo.java
          Transmitting file data .
          Committed revision 51596.

          Show
          Byron Nevins added a comment - Checked into 3.1.2 branch: d:\gf\branches\3.1.2\cluster>svn commit common\src\main\java\com\sun\enterprise\util\cluster\InstanceInfo.java Sending common\src\main\java\com\sun\enterprise\util\cluster\InstanceInfo.java Transmitting file data . Committed revision 51596.
          Byron Nevins made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Byron Nevins added a comment -

          Changing the timeout from 2 seconds to 60 seconds as requestd by Tom Mueller.

          Code Review: Tom

          Show
          Byron Nevins added a comment - Changing the timeout from 2 seconds to 60 seconds as requestd by Tom Mueller. Code Review: Tom
          Byron Nevins made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Byron Nevins added a comment -

          Now it is a 60 second timeout.
          Note that if there is a Zombie server – you'll have to wait the full 60 seconds for the command to complete.

          Sending D:\gf\branches\3.1.2\cluster\admin\src\main\java\com\sun\enterprise\v3\admin\cluster\ListInstancesCommand.java
          Sending D:\gf\trunk\main\nucleus\cluster\admin\src\main\java\com\sun\enterprise\v3\admin\cluster\ListInstancesCommand.java
          Transmitting file data ..
          Committed revision 51791.

          Show
          Byron Nevins added a comment - Now it is a 60 second timeout. Note that if there is a Zombie server – you'll have to wait the full 60 seconds for the command to complete. Sending D:\gf\branches\3.1.2\cluster\admin\src\main\java\com\sun\enterprise\v3\admin\cluster\ListInstancesCommand.java Sending D:\gf\trunk\main\nucleus\cluster\admin\src\main\java\com\sun\enterprise\v3\admin\cluster\ListInstancesCommand.java Transmitting file data .. Committed revision 51791.
          Byron Nevins made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Byron Nevins made changes -
          Link This issue is duplicated by GLASSFISH-18066 [ GLASSFISH-18066 ]
          Byron Nevins made changes -
          Link This issue blocks GLASSFISH-18091 [ GLASSFISH-18091 ]
          Byron Nevins made changes -
          Link This issue blocks GLASSFISH-18091 [ GLASSFISH-18091 ]
          Byron Nevins made changes -
          Link This issue is related to GLASSFISH-18091 [ GLASSFISH-18091 ]
          Byron Nevins made changes -
          Link This issue is related to GLASSFISH-20110 [ GLASSFISH-20110 ]

            People

            • Assignee:
              Byron Nevins
              Reporter:
              Tom Mueller
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: