There are numerous intermittent failures seen in various hudson jobs and all indicate that server is shutting down for no apparent reason mentioned in the log file. It needs to be investigated ASAP. The latest such falure is seen in .
This issue might be related to GLASSFISH-20133 (these might be duplicates).
Byron, please evaluate whether this is a duplicate of GLASSFISH-20133.
The root problem is known and scheduled for fixing in 4.0.1
It's unlikely to bite a user.
zThe problem is that we use 2 different ways to verify if a server is officially RUNNING or STOPPED.
That causes tight timing problems when Hudson is hammering commands one after the next.
duplicate of 20133
There are two reasons for reopening this bug, viz:
a) This bug was filed earlier than GLASSFISH-20133, so if at all these two are same issues, close GLASSFISH-20133 as a duplicate.
b) I don't exactly understand why this is not a priority. We have seen this bug outside hudson environment as well and on multiple platforms. e.g., I have seen it my linux development environment. Bill Shannon has seen this in his Solaris environment. In both cases, the bug was seen while running QuickLook test suite.
What is it blocking? Lowering to P2...
This issue is bereft of detail.
E.g. I was told separately that QL fails on both Linux and Solaris:
I don't have any feel for your QL failures. Everytime? Intermittent? Occasional? Rare?
I dropped everything in order to work on this as of right now...
Changing the synopsys back to describe the problem more accurately.
copy of my email to Shreedhar:
Here is what PROBABLY happens:
We use several different ways to see if a server is running.
1. Look for a pid file if we're running locally
2. run a remote command like "uptime". If we get a response, server is running
3. I forget – there might be a third.
You can see we can get timing problems. The pid file may get generated a millisecond BEFORE Grizzly is all set up. Or vice-versa.
The solution is to use the exact same criteria to establish RUNNING versus STOPPED for all commands.
Which is quite difficult and risky to get just right.
I've run QL a dozen times in a row. No failures.
I'm going to have to get invasive (in the source code) to force this error...
To be fixed for 4.0.1
Priority remains at P2
I don't understand the timing aspect. What we see from logs (which is still available in the referenced hudson job) is that server shutsdown automatically. Can you exqlain when that can haqqen (read q as the char before it as my key board is malfucntioning)? I can't recollect to have come across this kind of situation before. So, surely something in 4.0 is triggering this behavior.
I feel we should not defer this bug without understanding the root cause.
I don't see any evidence of the server shutting down "automatically".
Where? Please paste it in here...
Here are the CLI commands from that run:
98 03/29/2013 21:34:46 EXIT: 0 asadmin start-domain domain1
99 03/29/2013 21:34:52 EXIT: 0 asadmin create-jvm-options -Djava.security.manager
100 03/29/2013 21:34:54 EXIT: 0 asadmin stop-domain
101 03/29/2013 21:35:02 EXIT: 0 asadmin start-domain domain1
102 03/29/2013 21:45:14 EXIT: 1 nadmin --host localhost --port 4848 --secure=false --terse=false --echo=false --interactive=false start-domain --verbose=false --watchd og=false --debug=false --domaindir /scratch/java_re/BUILD_AREA/workspace/gf-trunk-build-continuous/gfv3-wp/glassfish4/glassfish/domains domain1
103 03/29/2013 21:45:23 EXIT: 1 asadmin restart-domain domain1
04 03/29/2013 21:50:31 EXIT: 0 asadmin deploy --upload ../../dist/basicweb/hellojsp.war
which was triggered by these commands:
Let me try to answer why I said the server was shutting down automatically." I was definitely influenced by the following comment made by Tom on 21 March over email while responding to some hudson job failure messages:
We've been seeing occasional failures of other test suites (GF admin devtests, nucleus admin devtests) in which there seems to be this behavior too, where it looks like the server just dies.
I will let Tom explain what made him come to this conclusion.
When I had seen QL failing in my local environment once (unfortunately I don't have the logs available to look at now - ask Bill if he has it), it was to do with the cluster test. The nature of failure made me come to similar conclusion as Tom.
You may be spot on with your analysis that server is not really crashing or automatically shutting down. Rather, it's just that asadmin client is failing to correctly detect that it's started. If that's the case, it is definitely as severe a problem as we initially thought and you can update the synopsys and priority. If we don't want to fix it now, I am hoping we can fix our tests to be more resilient.
Finally, do we know why we have started to see this behavior (failing to detect server has started correctly) only recently? I can't recollect to have seen such issue before. What change has trigged this issue?
It's quite rare. But definitely there. The last time the continuous build had it was 45 runs ago:
So the first big job is to make it reproducible
This issue is causing multiple failures in web devtests. Web devtests perform many start/stop for functionality testings that require server restart and failing to do so is affecting test results.
[echo] run.xml:stopping domain...
[exec] CLI306: Warning - The server located at /Users/aroh/glassfish/v3-8-2/all/main/appserver/distributions/glassfish/target/glassfish4/glassfish/domains/domain1 is not running.
[exec] Command stop-domain executed successfully.
[echo] Starting DAS
[exec] There is a process already using the admin port 4848 – it probably is another instance of a GlassFish server.
[exec] Command start-domain failed.
[exec] Result: 1
[echo] run.xml:doing wait for localhost:8080...
[echo] run.xml:DONE doing wait for localhost:8080...
In order to reproduce, run "ant all" from $APS_HOME/devtests/web.
This hypothesis was super-easy to test. It is not the problem:
"server may be starting too fast to detect"
There is no known way to reproduce the problem which makes it difficult to prove that it is fixed.
I've fixed it.
we were using several different techniques simultaneously to see if the server was restarted.
(1) compare timestamp of the local password file
(2) compare the "uptime" of the old server versus the uptime of the new server
I.e. we were using quite indirect methods.
I changed it to use something that is guaranteed to work. Here is the new algorithm for restart-domain (local)
1. get the PID of the running server (might be "-1" if the server isn't running)
2. tell the server to restart itself
3. In a loop – try fetching the PID. Once it changes – we are done, the server has restarted.
I've submitted the changes to Tom for review...
In summary, when this intermittent bug appears the client running restart-domain will not recognize that the server (domain or instance) restarted correctly. After 10 minutes it times out. The server is running. It's a false negative.
The impact on the customer is annoyance. The major impact is in automated tests.
It is unlikely that the customer will bump into this issue. He would have to do a start/stop/restart in fast succession.
It is not a serious bug because it is just the client reporting incorrect results. The server-side is working fine.
OTOH - it depends on one's opinion of what's serious. It gives the user a bad impression of the product if he sees it.
The cost to fix it is minimal, in fact it is already fixed and waiting to go in. If it doesn't go into 4.0 it'll go into 4.0.1
The bug appears rarely in Quick Look tests. Currently in about 1 out of 50 Hudson runs of the continuous dev builds.
Not only is the fix not complicated - it is the other way 'round. The existing code is overly complex. The fixed code is much much simpler.
There is little risk. Automated tests, including QuickLook test this area all the time.
No doc impact.
QA need only run their usual standard tests start/stop/restart
This is a core lifecycle fix.
Approved for 4.0. I've updated the fix version.
Transmitting file data ...
Committed revision 61575.
Does not handle the case where restart-domain is run by asadmin (local command) but does NOT use the local password for authentication.
Namely restart-domain -P 4848 -H whatever
Oops. reopened wrong issue. I should have reopened the issue this is a duplicate of!
Sorry I got confused with a raft of similar issues.
Transmitting file data ...
Committed revision 61693.