[GLASSFISH-17116] list-instances lets asadmin timeout when an instance is hung Created: 27/Jul/11  Updated: 17/Apr/13  Resolved: 27/Dec/11

Status: Resolved
Project: glassfish
Component/s: admin
Affects Version/s: 3.1.1
Fix Version/s: 3.1.2_b15, 4.0_b15

Type: Bug Priority: Major
Reporter: Tom Mueller Assignee: Byron Nevins
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by GLASSFISH-18066 Instance state not updated after startup Closed
Related
is related to GLASSFISH-20110 RARE Restart Server|Domain client-sid... Resolved
is related to GLASSFISH-18091 We Need One And Only One Way To Deter... Resolved
Tags: 3_1_2-approved

 Description   

When an instance is hung, then the list-instances command hangs too until asadmin finally times out after 600 seconds.

The reason for this is that list-instances (via InstanceState) uses InstanceCommandExecutor to run the __locations command on the instance. This class runs the command without any timeout.

To fix this bug, the connection to the instance should timeout after some reasonable interval (less than the time asadmin is waiting).

This issue is being raised due to hang problems that have been experienced with AIX testing. With these hangs, it is possible to initiate a TCP connection to the process, but the connection attempt just hangs; it isn't processed and it isn't refused. To simulate this, set a breakpoint in the __locations command of the instance and see what list-instances does.

The desirable output from list-instances in this situation is that the state of the instance would be reported as "non-responsive" or "hung".



 Comments   
Comment by sherryshen [ 27/Jul/11 ]

I raised an asadmin cli question from the tests in
http://java.net/jira/browse/GLASSFISH-16960
When one instance is hanging and another instance is killed or stopped in
"asadmin list-instances" gives timeout message,
"asadmin get-health st-cluster" gives 2 instances in failed status,
any way to report hanging status to help user to understand the problem?

Thank Tom for filing the bug,
http://java.net/jira/browse/GLASSFISH-17116
Its fix will help user to understand the status of instances.

Comment by sherryshen [ 29/Jul/11 ]

With the same hanging instance101 on aixas10,
glassfish treats the hanging instance as a running instance
in start-instance.
Can start-instance give a different message of hanging in
comparing with a normal instance?

bash-3.2# asadmin get-health st-cluster
instance101 failed since Mon Jul 25 17:38:44 PDT 2011
instance102 started since Tue Jul 26 17:14:02 PDT 2011
instance103 started since Tue Jul 26 17:14:19 PDT 2011
instance104 started since Wed Jul 27 06:54:17 PDT 2011
instance105 started since Tue Jul 26 17:14:03 PDT 2011
instance106 started since Tue Jul 26 17:14:19 PDT 2011
instance107 started since Thu Jul 28 10:34:59 PDT 2011
instance108 started since Tue Jul 26 17:14:09 PDT 2011
instance109 started since Tue Jul 26 17:14:17 PDT 2011
instance110 started since Tue Jul 26 17:14:20 PDT 2011
Command get-health executed successfully.
bash-3.2# asadmin start-instance instance101
Instance instance101 is already running.
Command start-instance executed successfully.
bash-3.2# asadmin list-instances instance101
No response from Domain Admin Server after 600 seconds.
The command is either taking too long to complete or the server has failed.
Please see the server log files for command status.
Command list-instances failed.
bash-3.2# date
Thu Jul 28 17:50:26 PDT 2011
bash-3.2#

Comment by scatari [ 04/Nov/11 ]

Please evaluate this as for possible inclusion into 3.1.2.

Comment by Byron Nevins [ 11/Dec/11 ]

This is in Vijay's code.

Is there someone that took ownership of Vijay's code?

Comment by Byron Nevins [ 15/Dec/11 ]

Very interesting. 2 hours to find this because who would expect it. What's wrong with this line of code?

InstanceCommandResult r = future.get(timeoutInMsec, TimeUnit.SECONDS)

It would time out. In 2000 seconds. Which is more than 30 minutes.

You can even test this w/o the fix like so:

list-instances --timeoutmsec 3 (for a 3-second timeout)

Comment by Byron Nevins [ 15/Dec/11 ]
  • What is the impact on the customer of the bug?

Apparently a big deal for customers using AIX. Also annoying if any instance is in the Zombie state.

How likely is it that a customer will see the bug and how serious is the bug?

Not likely unless they have a Zombie instance.
Apparently common on AIX

Is it a regression? Does it meet other bug fix criteria (security, performance, etc.)?
Yes. I wrote code that had a 2000 msec timeout. Someone changed that to 2000 seconds.

  • What is the cost/risk of fixing the bug?
    As close to zero as one can get

How risky is the fix? How much work is the fix? Is the fix complicated?
Very little work. Very very simple. No risk.

  • Is there an impact on documentation or message strings?
    No.
  • Which tests should QA (re)run to verify the fix did not destabilize GlassFish?
    General tests that list instances. They have no tests to exercise this. You need to be able to "hang" an instance to see the bug. For us developers it's trivial -->

add a Thread.sleep() in __locations based on whether an env. variable is set

  • Which is the targeted build of 3.1.2 for this fix?
    B15
Comment by Byron Nevins [ 15/Dec/11 ]

Since it is so trivial, and instructive, here's the fix:

BEFORE:
InstanceCommandResult r = future.get(timeoutInMsec, TimeUnit.SECONDS);

AFTER:
InstanceCommandResult r = future.get(timeoutInMsec, TimeUnit.MILLISECONDS);

Comment by Byron Nevins [ 15/Dec/11 ]

Here is the checkin to 4.0

Waiting for approval for 3.1.2

d:\gf\branches\3.1.2\cluster>svn commit D:/gf/trunk/main/nucleus/cluster/common/src/main/java/com/sun/enterprise/util/cluste
Sending D:\gf\trunk\main\nucleus\cluster\common\src\main\java\com\sun\enterprise\util\cluster\InstanceInfo.java
Transmitting file data .
Committed revision 51569.

Comment by Byron Nevins [ 15/Dec/11 ]

Checked into 3.1.2 branch:

d:\gf\branches\3.1.2\cluster>svn commit common\src\main\java\com\sun\enterprise\util\cluster\InstanceInfo.java
Sending common\src\main\java\com\sun\enterprise\util\cluster\InstanceInfo.java
Transmitting file data .
Committed revision 51596.

Comment by Byron Nevins [ 27/Dec/11 ]

Changing the timeout from 2 seconds to 60 seconds as requestd by Tom Mueller.

Code Review: Tom

Comment by Byron Nevins [ 27/Dec/11 ]

Now it is a 60 second timeout.
Note that if there is a Zombie server – you'll have to wait the full 60 seconds for the command to complete.

Sending D:\gf\branches\3.1.2\cluster\admin\src\main\java\com\sun\enterprise\v3\admin\cluster\ListInstancesCommand.java
Sending D:\gf\trunk\main\nucleus\cluster\admin\src\main\java\com\sun\enterprise\v3\admin\cluster\ListInstancesCommand.java
Transmitting file data ..
Committed revision 51791.

Generated at Thu Mar 05 16:59:48 UTC 2015 using JIRA 6.2.3#6260-sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.