I have a distributed cluster (1 das and 3 instances) all on separate machines.
I deploy a ejb web app to the das and cluster. I then send post messages to the
das and each instance which initiates all the instances to start sending
messages between each other. After 15 seconds I issue a asadmin
stop-local-instance to the second instance. The command returns with the following:
Waiting for the instance to stop
Timed out (60 seconds) waiting for the domain to stop.
Command stop-local-instance failed.
I then issue asadmin list-instances clustername and get back:
ins02 not running
I then issue asadmin get-health clustername and get back:
ins01 started since Thu Sep 23 11:29:58 PDT 2010
ins02 started since Thu Sep 23 11:29:58 PDT 2010
ins03 started since Thu Sep 23 11:29:57 PDT 2010
After this I try to connect to the HTTP port of ins02 (using a browser) and get:
Unable able to connect.
The whole time while this situtation was going on I used a script to
monitor the pids of all the java processes on the machine for ins02 and what I
noticed was that I saw the process for the stop-local-instance start and end but
never did the appserver terminate.
Looking in the DAS server log, I was able to determine that the test running
on ins02 finished it's work and sent a message to the das indicating such.
From all this information it appears that the appserver (ins02) was partially
shutdown via the stop-local-instance but that the GMS process and the ejb
container were not stopped and continued to run leaving the appserver in a
Debugging this issue is currently blocked as a result of the limited information
contained in the server log. See dependency bug