glassfish
  1. glassfish
  2. GLASSFISH-19672

Race Condition when stopping server causes false negative

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0_b87_RC3
    • Component/s: admin
    • Labels:
      None

      Description

      Problem -

      Occasionally a stop-local-instance command will return failure like so:

      ~/dev/bg/main/nucleus/common/glassfish-api> asadmin stop-instance i1
      remote failure: Error trying to stop the instance named i1 : Remote server does not listen for requests on [localhost:24848]. Is the server up?
      Command stop-instance failed.

      even though the server was successfully stopped.

      ======

        Issue Links

          Activity

          Hide
          Byron Nevins added a comment -

          The problem is this:

          1) ReST makes a HTTP connection from DAS to an instance server
          2) The command that it runs, StopInstanceInstanceCommand is asynchronous – the server commits suicide
          3) which, naturally, breaks the HTTP connedfction
          4) The actual lowest level place is (currently) line 1288 of RemoteRestAdminCommand – like so ==>

          RemoteRestAdminCommand
          Line 1288: int code = urlConnection.getResponseCode();

          Set a breakpoint there. On my Mac it will throw a ConnectException

          The server is running BEFORE 1288
          The server is stopped IMMEDIATELY AFTER 1288 (in the catch block)

          ===============
          This is trivially weasy to reproduce in a debugger since time grinds to a halt in a debugger. In DevTests it happens occasionally when a server dies really fast.

          The fix is very very simple – ignore ALL EXCEPTIONS! It doesn't matter! Really! The client code (in this case DAS itself) is going to verify that the instance really died. The return value is garbage from asynchronous commands – they just always return success.

          Hold on there – am I SURE this is 100% safe to do?
          Yes! Because:

          (1) before making the call to server we absolutely positively verify that the server is alive and responsive
          (2) We ask the server to kill itself
          (3) we ignore any exceptions thrown in (2)
          (4) we absolutely positively verify that the server is dead and gone before returning

          Yes - we also handle failure OK.

          Show
          Byron Nevins added a comment - The problem is this: 1) ReST makes a HTTP connection from DAS to an instance server 2) The command that it runs, StopInstanceInstanceCommand is asynchronous – the server commits suicide 3) which, naturally, breaks the HTTP connedfction 4) The actual lowest level place is (currently) line 1288 of RemoteRestAdminCommand – like so ==> RemoteRestAdminCommand Line 1288: int code = urlConnection.getResponseCode(); Set a breakpoint there. On my Mac it will throw a ConnectException The server is running BEFORE 1288 The server is stopped IMMEDIATELY AFTER 1288 (in the catch block) =============== This is trivially weasy to reproduce in a debugger since time grinds to a halt in a debugger. In DevTests it happens occasionally when a server dies really fast. The fix is very very simple – ignore ALL EXCEPTIONS! It doesn't matter! Really! The client code (in this case DAS itself) is going to verify that the instance really died. The return value is garbage from asynchronous commands – they just always return success. Hold on there – am I SURE this is 100% safe to do? Yes! Because: (1) before making the call to server we absolutely positively verify that the server is alive and responsive (2) We ask the server to kill itself (3) we ignore any exceptions thrown in (2) (4) we absolutely positively verify that the server is dead and gone before returning Yes - we also handle failure OK.
          Hide
          Byron Nevins added a comment -

          The fix is so tiny I'm adding it here:

          ndex: src/main/java/com/sun/enterprise/v3/admin/cluster/StopInstanceCommand.java
          ===================================================================
          — src/main/java/com/sun/enterprise/v3/admin/cluster/StopInstanceCommand.java (revision 61586)
          +++ src/main/java/com/sun/enterprise/v3/admin/cluster/StopInstanceCommand.java (working copy)
          @@ -280,11 +280,12 @@
          ParameterMap map = new ParameterMap();
          map.add("force", Boolean.toString(force));
          rac.executeCommand(map);
          + } catch (Exception e)

          { + // The instance server may have died so fast we didn't have time to + // get the (always successful!!) return data. This is NOT AN ERROR! + // see: http://java.net/jira/browse/GLASSFISH-19672 + // also see StopDomainCommand which does the same thing. }
          • catch (CommandException ex) { - return Strings.get("stop.instance.racError", instanceName, - ex.getLocalizedMessage()); - }

          return null;
          }

          Show
          Byron Nevins added a comment - The fix is so tiny I'm adding it here: ndex: src/main/java/com/sun/enterprise/v3/admin/cluster/StopInstanceCommand.java =================================================================== — src/main/java/com/sun/enterprise/v3/admin/cluster/StopInstanceCommand.java (revision 61586) +++ src/main/java/com/sun/enterprise/v3/admin/cluster/StopInstanceCommand.java (working copy) @@ -280,11 +280,12 @@ ParameterMap map = new ParameterMap(); map.add("force", Boolean.toString(force)); rac.executeCommand(map); + } catch (Exception e) { + // The instance server may have died so fast we didn't have time to + // get the (always successful!!) return data. This is NOT AN ERROR! + // see: http://java.net/jira/browse/GLASSFISH-19672 + // also see StopDomainCommand which does the same thing. } catch (CommandException ex) { - return Strings.get("stop.instance.racError", instanceName, - ex.getLocalizedMessage()); - } return null; }
          Hide
          Byron Nevins added a comment -

          In summary, when this intermittent bug appears DAS will not recognize that the instance server it told to stop - really stopped. If it stops VERY fast, then the Rest connection throws a (useless, unimportant) Exception which should be completely ignored.

          The impact on the customer is annoyance. It causes a false negative. The major impact is in automated tests.
          It is quite possible that the customer will bump into this issue.

          It is not a serious bug because the command actually succeeed but was reported as a failure. If the same command is run again – it will report that the cluster or instance is already stopped.

          OTOH - it depends on one's opinion of what's serious. It gives the user a bad impression of the product if he sees it.
          The cost to fix it is minimal, in fact it is already fixed and waiting to go in. If it doesn't go into 4.0 it'll go into 4.0.1

          The bug appears rarely in Quick Look tests. It's dependent on hardware.

          The fix is about as simple as can be – ignore an Exception instead of failing. Changed code is simpler than existing code.

          There is little risk. Automated tests, including QuickLook test this area all the time.

          No doc impact.

          QA need only run their usual standard tests start/stop/restart

          This is a core lifecycle fix.

          I highly recommend allowing this into 4.0
          It is about as risk-free as you can get.

          Show
          Byron Nevins added a comment - In summary, when this intermittent bug appears DAS will not recognize that the instance server it told to stop - really stopped. If it stops VERY fast, then the Rest connection throws a (useless, unimportant) Exception which should be completely ignored. The impact on the customer is annoyance. It causes a false negative. The major impact is in automated tests. It is quite possible that the customer will bump into this issue. It is not a serious bug because the command actually succeeed but was reported as a failure. If the same command is run again – it will report that the cluster or instance is already stopped. OTOH - it depends on one's opinion of what's serious. It gives the user a bad impression of the product if he sees it. The cost to fix it is minimal, in fact it is already fixed and waiting to go in. If it doesn't go into 4.0 it'll go into 4.0.1 The bug appears rarely in Quick Look tests. It's dependent on hardware. The fix is about as simple as can be – ignore an Exception instead of failing. Changed code is simpler than existing code. There is little risk. Automated tests, including QuickLook test this area all the time. No doc impact. QA need only run their usual standard tests start/stop/restart This is a core lifecycle fix. I highly recommend allowing this into 4.0 It is about as risk-free as you can get.
          Hide
          Tom Mueller added a comment -

          Approved for 4.0.

          Show
          Tom Mueller added a comment - Approved for 4.0.
          Hide
          Byron Nevins added a comment -

          Sending admin/src/main/java/com/sun/enterprise/v3/admin/cluster/StopInstanceCommand.java
          Transmitting file data .
          Committed revision 61631.

          Show
          Byron Nevins added a comment - Sending admin/src/main/java/com/sun/enterprise/v3/admin/cluster/StopInstanceCommand.java Transmitting file data . Committed revision 61631.

            People

            • Assignee:
              Byron Nevins
              Reporter:
              Byron Nevins
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved: