glassfish
  1. glassfish
  2. GLASSFISH-20110

RARE Restart Server|Domain client-side reporting failures - RACE condition

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 4.0_b82_EE7MS7
    • Fix Version/s: 4.0_b86_RC2
    • Component/s: admin
    • Labels:
      None

      Description

      There are numerous intermittent failures seen in various hudson jobs and all indicate that server is shutting down for no apparent reason mentioned in the log file. It needs to be investigated ASAP. The latest such falure is seen in [1].

      [1] http://gf-hudson.us.oracle.com/hudson/job/gf-trunk-build-continuous/13961

        Issue Links

          Activity

          Hide
          Tom Mueller added a comment -

          This issue might be related to GLASSFISH-20133 (these might be duplicates).

          Show
          Tom Mueller added a comment - This issue might be related to GLASSFISH-20133 (these might be duplicates).
          Hide
          Tom Mueller added a comment -

          Byron, please evaluate whether this is a duplicate of GLASSFISH-20133.

          Show
          Tom Mueller added a comment - Byron, please evaluate whether this is a duplicate of GLASSFISH-20133 .
          Hide
          Byron Nevins added a comment -

          The root problem is known and scheduled for fixing in 4.0.1

          It's unlikely to bite a user.

          zThe problem is that we use 2 different ways to verify if a server is officially RUNNING or STOPPED.
          That causes tight timing problems when Hudson is hammering commands one after the next.

          Show
          Byron Nevins added a comment - The root problem is known and scheduled for fixing in 4.0.1 It's unlikely to bite a user. zThe problem is that we use 2 different ways to verify if a server is officially RUNNING or STOPPED. That causes tight timing problems when Hudson is hammering commands one after the next.
          Hide
          Byron Nevins added a comment -

          duplicate of 20133

          Show
          Byron Nevins added a comment - duplicate of 20133
          Hide
          Sanjeeb Sahoo added a comment -

          There are two reasons for reopening this bug, viz:

          a) This bug was filed earlier than GLASSFISH-20133, so if at all these two are same issues, close GLASSFISH-20133 as a duplicate.

          b) I don't exactly understand why this is not a priority. We have seen this bug outside hudson environment as well and on multiple platforms. e.g., I have seen it my linux development environment. Bill Shannon has seen this in his Solaris environment. In both cases, the bug was seen while running QuickLook test suite.

          Show
          Sanjeeb Sahoo added a comment - There are two reasons for reopening this bug, viz: a) This bug was filed earlier than GLASSFISH-20133 , so if at all these two are same issues, close GLASSFISH-20133 as a duplicate. b) I don't exactly understand why this is not a priority. We have seen this bug outside hudson environment as well and on multiple platforms. e.g., I have seen it my linux development environment. Bill Shannon has seen this in his Solaris environment. In both cases, the bug was seen while running QuickLook test suite.
          Hide
          Byron Nevins added a comment -

          What is it blocking? Lowering to P2...

          Show
          Byron Nevins added a comment - What is it blocking? Lowering to P2...
          Hide
          Byron Nevins added a comment -

          This issue is bereft of detail.

          E.g. I was told separately that QL fails on both Linux and Solaris:

          I don't have any feel for your QL failures. Everytime? Intermittent? Occasional? Rare?

          Show
          Byron Nevins added a comment - This issue is bereft of detail. E.g. I was told separately that QL fails on both Linux and Solaris: I don't have any feel for your QL failures. Everytime? Intermittent? Occasional? Rare?
          Hide
          Byron Nevins added a comment -

          I dropped everything in order to work on this as of right now...

          Show
          Byron Nevins added a comment - I dropped everything in order to work on this as of right now...
          Hide
          Sanjeeb Sahoo added a comment -

          Changing the synopsys back to describe the problem more accurately.

          Show
          Sanjeeb Sahoo added a comment - Changing the synopsys back to describe the problem more accurately.
          Hide
          Byron Nevins added a comment -

          copy of my email to Shreedhar:

          =================

          Here is what PROBABLY happens:

          We use several different ways to see if a server is running.
          1. Look for a pid file if we're running locally
          2. run a remote command like "uptime". If we get a response, server is running
          3. I forget – there might be a third.

          You can see we can get timing problems. The pid file may get generated a millisecond BEFORE Grizzly is all set up. Or vice-versa.

          The solution is to use the exact same criteria to establish RUNNING versus STOPPED for all commands.

          Which is quite difficult and risky to get just right.

          Show
          Byron Nevins added a comment - copy of my email to Shreedhar: ================= Here is what PROBABLY happens: We use several different ways to see if a server is running. 1. Look for a pid file if we're running locally 2. run a remote command like "uptime". If we get a response, server is running 3. I forget – there might be a third. You can see we can get timing problems. The pid file may get generated a millisecond BEFORE Grizzly is all set up. Or vice-versa. The solution is to use the exact same criteria to establish RUNNING versus STOPPED for all commands. Which is quite difficult and risky to get just right.
          Hide
          Byron Nevins added a comment -

          I've run QL a dozen times in a row. No failures.

          I'm going to have to get invasive (in the source code) to force this error...

          Show
          Byron Nevins added a comment - I've run QL a dozen times in a row. No failures. I'm going to have to get invasive (in the source code) to force this error...
          Hide
          Byron Nevins added a comment -

          To be fixed for 4.0.1
          Priority remains at P2

          Show
          Byron Nevins added a comment - To be fixed for 4.0.1 Priority remains at P2
          Hide
          Sanjeeb Sahoo added a comment -

          I don't understand the timing aspect. What we see from logs (which is still available in the referenced hudson job) is that server shutsdown automatically. Can you exqlain when that can haqqen (read q as the char before it as my key board is malfucntioning)? I can't recollect to have come across this kind of situation before. So, surely something in 4.0 is triggering this behavior.

          Show
          Sanjeeb Sahoo added a comment - I don't understand the timing aspect. What we see from logs (which is still available in the referenced hudson job) is that server shutsdown automatically. Can you exqlain when that can haqqen (read q as the char before it as my key board is malfucntioning)? I can't recollect to have come across this kind of situation before. So, surely something in 4.0 is triggering this behavior.
          Hide
          Sanjeeb Sahoo added a comment -

          I feel we should not defer this bug without understanding the root cause.

          Show
          Sanjeeb Sahoo added a comment - I feel we should not defer this bug without understanding the root cause.
          Hide
          Byron Nevins added a comment -

          I don't see any evidence of the server shutting down "automatically".
          Where? Please paste it in here...

          Show
          Byron Nevins added a comment - I don't see any evidence of the server shutting down "automatically". Where? Please paste it in here...
          Hide
          Byron Nevins added a comment -

          Here are the CLI commands from that run:

          98 03/29/2013 21:34:46 EXIT: 0 asadmin start-domain domain1
          99 03/29/2013 21:34:52 EXIT: 0 asadmin create-jvm-options -Djava.security.manager
          100 03/29/2013 21:34:54 EXIT: 0 asadmin stop-domain
          101 03/29/2013 21:35:02 EXIT: 0 asadmin start-domain domain1
          102 03/29/2013 21:45:14 EXIT: 1 nadmin --host localhost --port 4848 --secure=false --terse=false --echo=false --interactive=false start-domain --verbose=false --watchd og=false --debug=false --domaindir /scratch/java_re/BUILD_AREA/workspace/gf-trunk-build-continuous/gfv3-wp/glassfish4/glassfish/domains domain1
          103 03/29/2013 21:45:23 EXIT: 1 asadmin restart-domain domain1
          04 03/29/2013 21:50:31 EXIT: 0 asadmin deploy --upload ../../dist/basicweb/hellojsp.war
          ===============

          which was triggered by these commands:
          1. start-domain
          2. create-jvm-options
          3. stop-domain
          4. start-domain
          5. restart-domain

            • fails. Waits 10 minutes because it *INCORRECTLY* thinks the server did not restart. But it *did* restart based on the next cli commands
              6. deploy – which can't possibly succeed if the server is not running.
          Show
          Byron Nevins added a comment - Here are the CLI commands from that run: 98 03/29/2013 21:34:46 EXIT: 0 asadmin start-domain domain1 99 03/29/2013 21:34:52 EXIT: 0 asadmin create-jvm-options -Djava.security.manager 100 03/29/2013 21:34:54 EXIT: 0 asadmin stop-domain 101 03/29/2013 21:35:02 EXIT: 0 asadmin start-domain domain1 102 03/29/2013 21:45:14 EXIT: 1 nadmin --host localhost --port 4848 --secure=false --terse=false --echo=false --interactive=false start-domain --verbose=false --watchd og=false --debug=false --domaindir /scratch/java_re/BUILD_AREA/workspace/gf-trunk-build-continuous/gfv3-wp/glassfish4/glassfish/domains domain1 103 03/29/2013 21:45:23 EXIT: 1 asadmin restart-domain domain1 04 03/29/2013 21:50:31 EXIT: 0 asadmin deploy --upload ../../dist/basicweb/hellojsp.war =============== which was triggered by these commands: 1. start-domain 2. create-jvm-options 3. stop-domain 4. start-domain 5. restart-domain fails. Waits 10 minutes because it * INCORRECTLY * thinks the server did not restart. But it * did * restart based on the next cli commands 6. deploy – which can't possibly succeed if the server is not running.
          Hide
          Sanjeeb Sahoo added a comment -

          Byron,

          Let me try to answer why I said the server was shutting down automatically." I was definitely influenced by the following comment made by Tom on 21 March over email while responding to some hudson job failure messages:

          We've been seeing occasional failures of other test suites (GF admin devtests, nucleus admin devtests) in which there seems to be this behavior too, where it looks like the server just dies.

          I will let Tom explain what made him come to this conclusion.

          When I had seen QL failing in my local environment once (unfortunately I don't have the logs available to look at now - ask Bill if he has it), it was to do with the cluster test. The nature of failure made me come to similar conclusion as Tom.

          You may be spot on with your analysis that server is not really crashing or automatically shutting down. Rather, it's just that asadmin client is failing to correctly detect that it's started. If that's the case, it is definitely as severe a problem as we initially thought and you can update the synopsys and priority. If we don't want to fix it now, I am hoping we can fix our tests to be more resilient.

          Finally, do we know why we have started to see this behavior (failing to detect server has started correctly) only recently? I can't recollect to have seen such issue before. What change has trigged this issue?

          Thanks,
          Sahoo

          Show
          Sanjeeb Sahoo added a comment - Byron, Let me try to answer why I said the server was shutting down automatically." I was definitely influenced by the following comment made by Tom on 21 March over email while responding to some hudson job failure messages: We've been seeing occasional failures of other test suites (GF admin devtests, nucleus admin devtests) in which there seems to be this behavior too, where it looks like the server just dies. I will let Tom explain what made him come to this conclusion. When I had seen QL failing in my local environment once (unfortunately I don't have the logs available to look at now - ask Bill if he has it), it was to do with the cluster test. The nature of failure made me come to similar conclusion as Tom. You may be spot on with your analysis that server is not really crashing or automatically shutting down. Rather, it's just that asadmin client is failing to correctly detect that it's started. If that's the case, it is definitely as severe a problem as we initially thought and you can update the synopsys and priority. If we don't want to fix it now, I am hoping we can fix our tests to be more resilient. Finally, do we know why we have started to see this behavior (failing to detect server has started correctly) only recently? I can't recollect to have seen such issue before. What change has trigged this issue? Thanks, Sahoo
          Hide
          Byron Nevins added a comment -

          It's quite rare. But definitely there. The last time the continuous build had it was 45 runs ago:

          http://gf-hudson.us.oracle.com/hudson/job/gf-trunk-build-continuous/14089/

          So the first big job is to make it reproducible

          Show
          Byron Nevins added a comment - It's quite rare. But definitely there. The last time the continuous build had it was 45 runs ago: http://gf-hudson.us.oracle.com/hudson/job/gf-trunk-build-continuous/14089/ So the first big job is to make it reproducible
          Hide
          Amy Roh added a comment -

          This issue is causing multiple failures in web devtests. Web devtests perform many start/stop for functionality testings that require server restart and failing to do so is affecting test results.

          stopDomain:
          [echo] run.xml:stopping domain...
          [exec] CLI306: Warning - The server located at /Users/aroh/glassfish/v3-8-2/all/main/appserver/distributions/glassfish/target/glassfish4/glassfish/domains/domain1 is not running.
          [exec] Command stop-domain executed successfully.

          setOSConditions:

          setToolWin:

          setToolUnix:

          setToolProperty:

          setS1ASclassPath:

          init-common:

          startDomainUnix:
          [echo] Starting DAS
          [exec] There is a process already using the admin port 4848 – it probably is another instance of a GlassFish server.
          [exec] Command start-domain failed.
          [exec] Result: 1
          [echo] run.xml:doing wait for localhost:8080...
          [echo] run.xml:DONE doing wait for localhost:8080...

          In order to reproduce, run "ant all" from $APS_HOME/devtests/web.

          Show
          Amy Roh added a comment - This issue is causing multiple failures in web devtests. Web devtests perform many start/stop for functionality testings that require server restart and failing to do so is affecting test results. stopDomain: [echo] run.xml:stopping domain... [exec] CLI306: Warning - The server located at /Users/aroh/glassfish/v3-8-2/all/main/appserver/distributions/glassfish/target/glassfish4/glassfish/domains/domain1 is not running. [exec] Command stop-domain executed successfully. setOSConditions: setToolWin: setToolUnix: setToolProperty: setS1ASclassPath: init-common: startDomainUnix: [echo] Starting DAS [exec] There is a process already using the admin port 4848 – it probably is another instance of a GlassFish server. [exec] Command start-domain failed. [exec] Result: 1 [echo] run.xml:doing wait for localhost:8080... [echo] run.xml:DONE doing wait for localhost:8080... In order to reproduce, run "ant all" from $APS_HOME/devtests/web.
          Hide
          Byron Nevins added a comment -

          This hypothesis was super-easy to test. It is not the problem:

          "server may be starting too fast to detect"

          Show
          Byron Nevins added a comment - This hypothesis was super-easy to test. It is not the problem: "server may be starting too fast to detect"
          Hide
          Byron Nevins added a comment -

          There is no known way to reproduce the problem which makes it difficult to prove that it is fixed.

          I've fixed it.

          we were using several different techniques simultaneously to see if the server was restarted.

          Namely:
          (1) compare timestamp of the local password file
          (2) compare the "uptime" of the old server versus the uptime of the new server

          I.e. we were using quite indirect methods.

          I changed it to use something that is guaranteed to work. Here is the new algorithm for restart-domain (local)

          1. get the PID of the running server (might be "-1" if the server isn't running)
          2. tell the server to restart itself
          3. In a loop – try fetching the PID. Once it changes – we are done, the server has restarted.

          I've submitted the changes to Tom for review...

          Show
          Byron Nevins added a comment - There is no known way to reproduce the problem which makes it difficult to prove that it is fixed. I've fixed it. we were using several different techniques simultaneously to see if the server was restarted. Namely: (1) compare timestamp of the local password file (2) compare the "uptime" of the old server versus the uptime of the new server I.e. we were using quite indirect methods. I changed it to use something that is guaranteed to work. Here is the new algorithm for restart-domain (local) 1. get the PID of the running server (might be "-1" if the server isn't running) 2. tell the server to restart itself 3. In a loop – try fetching the PID. Once it changes – we are done, the server has restarted. I've submitted the changes to Tom for review...
          Hide
          Byron Nevins added a comment -

          In summary, when this intermittent bug appears the client running restart-domain will not recognize that the server (domain or instance) restarted correctly. After 10 minutes it times out. The server is running. It's a false negative.

          The impact on the customer is annoyance. The major impact is in automated tests.
          It is unlikely that the customer will bump into this issue. He would have to do a start/stop/restart in fast succession.
          It is not a serious bug because it is just the client reporting incorrect results. The server-side is working fine.
          OTOH - it depends on one's opinion of what's serious. It gives the user a bad impression of the product if he sees it.
          The cost to fix it is minimal, in fact it is already fixed and waiting to go in. If it doesn't go into 4.0 it'll go into 4.0.1

          The bug appears rarely in Quick Look tests. Currently in about 1 out of 50 Hudson runs of the continuous dev builds.

          Not only is the fix not complicated - it is the other way 'round. The existing code is overly complex. The fixed code is much much simpler.

          There is little risk. Automated tests, including QuickLook test this area all the time.

          No doc impact.

          QA need only run their usual standard tests start/stop/restart
          This is a core lifecycle fix.

          Show
          Byron Nevins added a comment - In summary, when this intermittent bug appears the client running restart-domain will not recognize that the server (domain or instance) restarted correctly. After 10 minutes it times out. The server is running. It's a false negative. The impact on the customer is annoyance. The major impact is in automated tests. It is unlikely that the customer will bump into this issue. He would have to do a start/stop/restart in fast succession. It is not a serious bug because it is just the client reporting incorrect results. The server-side is working fine. OTOH - it depends on one's opinion of what's serious. It gives the user a bad impression of the product if he sees it. The cost to fix it is minimal, in fact it is already fixed and waiting to go in. If it doesn't go into 4.0 it'll go into 4.0.1 The bug appears rarely in Quick Look tests. Currently in about 1 out of 50 Hudson runs of the continuous dev builds. Not only is the fix not complicated - it is the other way 'round. The existing code is overly complex. The fixed code is much much simpler. There is little risk. Automated tests, including QuickLook test this area all the time. No doc impact. QA need only run their usual standard tests start/stop/restart This is a core lifecycle fix.
          Hide
          Tom Mueller added a comment -

          Approved for 4.0. I've updated the fix version.

          Show
          Tom Mueller added a comment - Approved for 4.0. I've updated the fix version.
          Hide
          Byron Nevins added a comment -

          Sending admin/server-mgmt/src/main/java/com/sun/enterprise/admin/servermgmt/cli/LocalServerCommand.java
          Sending admin/server-mgmt/src/main/java/com/sun/enterprise/admin/servermgmt/cli/RestartDomainCommand.java
          Sending cluster/cli/src/main/java/com/sun/enterprise/admin/cli/cluster/RestartLocalInstanceCommand.java
          Transmitting file data ...
          Committed revision 61575.

          Show
          Byron Nevins added a comment - Sending admin/server-mgmt/src/main/java/com/sun/enterprise/admin/servermgmt/cli/LocalServerCommand.java Sending admin/server-mgmt/src/main/java/com/sun/enterprise/admin/servermgmt/cli/RestartDomainCommand.java Sending cluster/cli/src/main/java/com/sun/enterprise/admin/cli/cluster/RestartLocalInstanceCommand.java Transmitting file data ... Committed revision 61575.
          Hide
          Byron Nevins added a comment -

          Does not handle the case where restart-domain is run by asadmin (local command) but does NOT use the local password for authentication.

          Namely restart-domain -P 4848 -H whatever

          Show
          Byron Nevins added a comment - Does not handle the case where restart-domain is run by asadmin (local command) but does NOT use the local password for authentication. Namely restart-domain -P 4848 -H whatever
          Hide
          Byron Nevins added a comment -

          Oops. reopened wrong issue. I should have reopened the issue this is a duplicate of!

          Show
          Byron Nevins added a comment - Oops. reopened wrong issue. I should have reopened the issue this is a duplicate of!
          Hide
          Byron Nevins added a comment -

          Sorry I got confused with a raft of similar issues.

          Show
          Byron Nevins added a comment - Sorry I got confused with a raft of similar issues.
          Hide
          Byron Nevins added a comment -

          Sending server-mgmt/src/main/java/com/sun/enterprise/admin/servermgmt/cli/LocalServerCommand.java
          Sending server-mgmt/src/main/java/com/sun/enterprise/admin/servermgmt/cli/RestartDomainCommand.java
          Sending server-mgmt/src/main/java/com/sun/enterprise/admin/servermgmt/cli/StopDomainCommand.java
          Transmitting file data ...
          Committed revision 61693.

          Show
          Byron Nevins added a comment - Sending server-mgmt/src/main/java/com/sun/enterprise/admin/servermgmt/cli/LocalServerCommand.java Sending server-mgmt/src/main/java/com/sun/enterprise/admin/servermgmt/cli/RestartDomainCommand.java Sending server-mgmt/src/main/java/com/sun/enterprise/admin/servermgmt/cli/StopDomainCommand.java Transmitting file data ... Committed revision 61693.

            People

            • Assignee:
              Byron Nevins
              Reporter:
              Sanjeeb Sahoo
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: