sailfin
  1. sailfin
  2. SAILFIN-1869

Subscribe-refresh, an instance was killed, were created a lot of errors, communication stopped.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: milestone 1
    • Component/s: session_replication
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

      Description

      ********************************************************************************
      **********************

      • Template v0.1 ( 05/01/08 )
      • Sailfin Stress test issue
        ********************************************************************************
        **********************
        Sailfin Build : 23
        Cluster size : 10
        Happens in a single instance (y/n) ? : NA
        Test id : st2_4_presence_subscribe-refresh
        Location of the test : as-telco-sqe/stress-ws/presence
        JDK version : 1.6.0_14, 64 bits
        CLB used : Yes
        HW LB used : No.
        SSR: Enabled
        =========================================================

      SuSe machines (asqe-oblade-{1-10].sfbay.sun.com), one instance per a machine,
      Were running 9 sipp.

      The loading was -m 333333 -r 305 per one sipp (9 sipp sessions totally).

      The run was fine during about 4 hours, until one instance (instance4) was
      killed.

      Then countless number of error messages were created in server.log files:

      ==============================================================
      SEVERE|sun-glassfish-comms-
      server2.0|javax.enterprise.system.container.sip|_ThreadID=22;_ThreadName=SipCont
      ainer-serversWorkerThread-5060-6;_RequestID=d13298fe-d6ae-483b-9e29-
      37caa9caf17c;|"Cant find matching transaction - Terminating"|#]

      WARNING|sun-glassfish-comms-
      server2.0|javax.enterprise.system.container.sip|_ThreadID=34;_ThreadName=Thread-
      39;_RequestID=d1046b7e-5ae4-48f6-a16f-938f2aa8a336;|Transaction was null:
      z9hG4bKd57df8d052efe87b4fd69d3553337d242ea5|#]
      ================================================================

      Finally, after several hours of such error messages in server.log files, the
      sipp communication stopped. On sipp screens after the instance was killed I saw
      around 300000 errors per a screen.

      Please see the logs from this run at /net/asqe-
      logs/export1/SailFin/Results/sfbuild23/sbrf.

        Issue Links

          Activity

          Hide
          easarina added a comment -

          Added a keyword: system-test

          Show
          easarina added a comment - Added a keyword: system-test
          Hide
          easarina added a comment -

          I've re-run this test on x86 machines. When an instance was killed, again was
          crated a huge number of error message:
          "Cant find matching transaction - Terminating"
          "Transaction was null"

          But then the heap became Full. And the communication stopped.

          See logs from this run at :
          /net/asqe-logs/export1/SailFin/Results/sfbuild23/sbrf_run2

          Show
          easarina added a comment - I've re-run this test on x86 machines. When an instance was killed, again was crated a huge number of error message: "Cant find matching transaction - Terminating" "Transaction was null" But then the heap became Full. And the communication stopped. See logs from this run at : /net/asqe-logs/export1/SailFin/Results/sfbuild23/sbrf_run2
          Hide
          Scott Oaks added a comment -

          The can't find matching comes from a container bug; that needs to be addressed
          before we can investigate if there are additional ill effects.

          Show
          Scott Oaks added a comment - The can't find matching comes from a container bug; that needs to be addressed before we can investigate if there are additional ill effects.
          Hide
          easarina added a comment -

          Build 25. I've executed this test on SuSE machines. Before an instance was
          killed, the run was OK. After an instance was killed, I saw error messages in
          server.log files, including many "Cant find matching transaction - Terminating"
          and soon OOM happened. See all logs under:

          http://agni-1.sfbay.sun.com/net/asqe-logs/export1/SailFin/Results/sfbuild25/sbr_ssr/

          Show
          easarina added a comment - Build 25. I've executed this test on SuSE machines. Before an instance was killed, the run was OK. After an instance was killed, I saw error messages in server.log files, including many "Cant find matching transaction - Terminating" and soon OOM happened. See all logs under: http://agni-1.sfbay.sun.com/net/asqe-logs/export1/SailFin/Results/sfbuild25/sbr_ssr/
          Hide
          Scott Oaks added a comment -

          Build 25 did not contain the SSR-OOM fixes targeted for build 27 (particular
          issues 1862 and 1888).

          Errors in the build 25 log prior to the first failure indicate that something
          else is likely wrong in the configuration – there were network issues before
          any failure was induced.

          Need to re-examine for build 27.

          Show
          Scott Oaks added a comment - Build 25 did not contain the SSR-OOM fixes targeted for build 27 (particular issues 1862 and 1888). Errors in the build 25 log prior to the first failure indicate that something else is likely wrong in the configuration – there were network issues before any failure was induced. Need to re-examine for build 27.
          Hide
          easarina added a comment -

          I can see in the one server server.log files (inst1) few "Can not find matching
          transaction - Terminating" messages and really nothing else. As I can see,
          based on the different tests, the number of terminated transactions depends
          from the loading. I agree that with new fixes the run has to be executed again.
          But could you clarify what was wrong in the configuration.

          Show
          easarina added a comment - I can see in the one server server.log files (inst1) few "Can not find matching transaction - Terminating" messages and really nothing else. As I can see, based on the different tests, the number of terminated transactions depends from the loading. I agree that with new fixes the run has to be executed again. But could you clarify what was wrong in the configuration.
          Hide
          Joe Fialli added a comment -

          reassign

          Show
          Joe Fialli added a comment - reassign
          Hide
          Joe Fialli added a comment -

          Patch from tuesday looked to fix this on Steve DiMilla's run of
          subscribe refresh. No errors after running for 3-4 days.
          This issue is an umbrella issue and this status applies to all
          children issues.

          Testing latest version of patch today to verify they all remain fixed.

          Show
          Joe Fialli added a comment - Patch from tuesday looked to fix this on Steve DiMilla's run of subscribe refresh. No errors after running for 3-4 days. This issue is an umbrella issue and this status applies to all children issues. Testing latest version of patch today to verify they all remain fixed.
          Hide
          Joe Fialli added a comment -

          No longer seeing this issue of Steve DiMilla's subscribe-refresh run
          that has been running for 4 days now.

          It was fixed by checkin to fix issues 1607 and 1613 on 8/12

          Show
          Joe Fialli added a comment - No longer seeing this issue of Steve DiMilla's subscribe-refresh run that has been running for 4 days now. It was fixed by checkin to fix issues 1607 and 1613 on 8/12

            People

            • Assignee:
              Joe Fialli
              Reporter:
              easarina
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: