sailfin
  1. sailfin
  2. SAILFIN-1441

Hardware Failure System Test: Many missing Notifies after injecting hardware failure

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: milestone 1
    • Component/s: session_replication
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

    • Issuezilla Id:
      1,441

      Description

      ******************************************************************************************************

      • Template v0.1 ( 05/01/08 )
      • Sailfin Stress test issue
        ******************************************************************************************************
        Sailfin Build : 60a
        Cluster size : 10
        Happens in a single instance (y/n) ? : NA
        Test id : st2_2_presence_subscribe-60-failure
        Location of the test : as-telco-sqe/stress-ws/presence
        JDK version : 1.6.0_07
        CLB used : Yes
        HW LB used : Yes.
        SSR: Enabled

      Deviations from README :

      Used the following sipp command to start "9" SIPp instances (if not using
      Barracuda, then each for a different instance).

      sipp -t t1 -sf st2_2_presence_subscribe-60-failure.xml -r 111 -l 10000 -nd
      -trace_screen -trace_err -trace_logs -buff_size 33554432
      <sailfin-instance-host:port> -reconnect_close false -max_reconnect 10
      -reconnect_sleep 3000

      • Let the test run for about 8 minutes.
      • Power-off the machine containing the one instance with which no SIPp client
        was associated.
      • Now, let the test run for another 20 minutes.
      • Power-on the machine back up, do the necessary system tunings, restart the
        node-agent and the instance.
      • Let the test run for about 10 more minutes.
      • Pause all SIPp traffic i.e. Press the key "p" on all the SIPp screens to pause
        all the traffic (10 minutes after the restart of the failed instance). Wait for
        a minute after that to observe the SIPp screens.
      • SIPp screens were saved 5 times using (KILL -SIGUSR2 sipp-process-id) - after
        a few minutes of failure, after ~15 minutes of failure (to log the timed-out
        notifies), twice after restarting the instance and finally after pausing traffic
        on all the SIPp clients (to log the total missed notifies)

      Please note that the Notify timeout was set to 15 minutes in the scenario file.

      Observations:

      • A total of ~600 Notify timed-out seen together on all SIPp screens about 15
        minutes after the power off.
      • A total of ~6700 Notify timed-out seen after the entire test was complete.

      This issue is not the same as issue 1391. In the case of issue 1391 there were
      about 150K Notifies missing and explanations and fixes for those are available
      in issue 1392.

      1. server.log
        69 kB
        Bhavanishankar
      2. st2_2_presence_subscribe-60-failure_2661_screen.log
        7 kB
        Bhavanishankar

        Issue Links

          Activity

          Hide
          Bhavanishankar added a comment -

          On Dec 5th when I had run Subscribe 60 with a single failure and I had seen only
          ~40 lost notifies. Based on that I had updated the issue.

          But with b60c surprisingly I saw ~500 lost notifies, instead of ~40.

          So I wanted to re-validate that I am correct on my evaluation and that there was
          a regression in b60c (after Dec 5th build). So, I re-ran subscribe 60 scenario
          with Dec 5th build (sailfin-image-v1-b60b-nightly-05_dec_2008.jar), and attached
          (just above this comment) are the logs which confirm my evaluation. Also notice
          that the test is running continuously without pause for an hour after the
          failure with only 46 lost notifies.

          So, something is regressed in b60c (after Dec5). But since I have not done any
          checkins during that period, I suspect that some other checkin b/w 5th - 10th
          has caused this regression. So, I would appreciate if Jan can look into 1441 and
          get it back to the stage where it was on 5th, and then assign it back to me if
          there are further issues to be dealt with.

          I hope I cleared the confusion around 1441.

          Show
          Bhavanishankar added a comment - On Dec 5th when I had run Subscribe 60 with a single failure and I had seen only ~40 lost notifies. Based on that I had updated the issue. But with b60c surprisingly I saw ~500 lost notifies, instead of ~40. So I wanted to re-validate that I am correct on my evaluation and that there was a regression in b60c (after Dec 5th build). So, I re-ran subscribe 60 scenario with Dec 5th build (sailfin-image-v1-b60b-nightly-05_dec_2008.jar), and attached (just above this comment) are the logs which confirm my evaluation. Also notice that the test is running continuously without pause for an hour after the failure with only 46 lost notifies. So, something is regressed in b60c (after Dec5). But since I have not done any checkins during that period, I suspect that some other checkin b/w 5th - 10th has caused this regression. So, I would appreciate if Jan can look into 1441 and get it back to the stage where it was on 5th, and then assign it back to me if there are further issues to be dealt with. I hope I cleared the confusion around 1441.
          Hide
          varunrupela added a comment -

          Making this issue a sub-issue of the newly opened umbrella issue 1499 to keep
          track of the scenario. Re-prioritizing this issue as P2 (as opposed to opening a
          separate one for 85cps) since this issue has all the context.

          The last run was done at 85 cps and this issue of missing notifies was noted.
          ~8000 Notifies were found missing at the end of the scenario (which included
          power-off and power-on of one of the machines). The last run was done with
          nightly build 60c from 10th Dec.

          Show
          varunrupela added a comment - Making this issue a sub-issue of the newly opened umbrella issue 1499 to keep track of the scenario. Re-prioritizing this issue as P2 (as opposed to opening a separate one for 85cps) since this issue has all the context. The last run was done at 85 cps and this issue of missing notifies was noted. ~8000 Notifies were found missing at the end of the scenario (which included power-off and power-on of one of the machines). The last run was done with nightly build 60c from 10th Dec.
          Hide
          varunrupela added a comment -
          • Updated Summary to track failure through this issue.
            Separate issue 1504 has been opened for ease of tracking by Dev and QE for
            restart of the instance.
          • Also moving dependencies to the umbrella issue 1499.
          Show
          varunrupela added a comment - Updated Summary to track failure through this issue. Separate issue 1504 has been opened for ease of tracking by Dev and QE for restart of the instance. Also moving dependencies to the umbrella issue 1499.
          Hide
          Bhavanishankar added a comment -

          With the latest build, and with default jxta settings and with load-governor
          OFF, subscribe_60 failure works fine. Hence marking this issue as fixed.

          Show
          Bhavanishankar added a comment - With the latest build, and with default jxta settings and with load-governor OFF, subscribe_60 failure works fine. Hence marking this issue as fixed.
          Hide
          Bhavanishankar added a comment -

          The default JXTA settings are :

          -DjxtaTcpMaxPoolSize=100
          -DjxtaTcpCorePoolSize=5
          -DjxtaTcpBlockingQueueSize=10
          -DjxtaMulticastPoolsize=300

          To switch off, load governor, add the following property under
          availablility-service:

          <property name="replication_load_governor_enabled" value="false"/>

          Show
          Bhavanishankar added a comment - The default JXTA settings are : -DjxtaTcpMaxPoolSize=100 -DjxtaTcpCorePoolSize=5 -DjxtaTcpBlockingQueueSize=10 -DjxtaMulticastPoolsize=300 To switch off, load governor, add the following property under availablility-service: <property name="replication_load_governor_enabled" value="false"/>

            People

            • Assignee:
              Bhavanishankar
              Reporter:
              varunrupela
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: