sailfin
  1. sailfin
  2. SAILFIN-1927

[blocking] Disable/Enable of an instance causing uneven traffic distribution

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: b30
    • Component/s: session_replication
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

      Description

      ******************************************************************************************************

      • Template v0.1 ( 05/01/08 )
      • Sailfin Stress test issue
        ******************************************************************************************************
        Sailfin Build : 28
        Cluster size : 9
        Happens in a single instance (y/n) ? : NA
        Test id : st2_4_presence_subscribe-refresh-failure
        Location of the test : as-telco-sqe/stress-ws/presence
        JDK version : 1.6.0_16
        CLB used : Yes
        HW LB used : Yes.
        SSR: Enabled

      A 9 instance cluster was used in this test. 4 instances were used as FE/BE
      (instance103, instance104, instance105 and instance109), the other 5 as BE
      alone. 4 SIPp instances were started with the following command to get an
      effective call rate of 140cps and load 150k sessions per instance:

      sipp -t t1 -sf st2_4_presence_subscribe-refresh-failure.xml -r 315 -l 338000 -d
      1073000 -nd -trace_err -trace_screen -trace_logs -buff_size 33554433
      -reconnect_close false -max_reconnect 10 -reconnect_sleep 3000
      <instance-host>:35060

      Rolling Upgrade steps were performed on instance101.
      http://wiki.glassfish.java.net/attach/SFv2FunctionalSpecs/rolling_upgrade_one_pager_ver2.html

      Logs for the run are available at:
      sf-x2200-11:/space/sony/logs/2.0/b28/subscribe-refresh-RU/IEC1-SuSE/

      Issue:
      On running disable-converged-lb-server instance101 (this was done at time -
      11:42:51):
      a. All other instances saw a drop in traffic (See the presence-stats.txt file
      under the instance logs)
      b. instance101 still shows up as a healthy instance in the server logs of other
      instances.

      On running enable-converged-lb-server instance101 (done at time 11:54:28)
      traffic distribution was quite un-even for an extended period of time. To some
      instances traffic re-started only after the reconcile step was completed.

      This caused SIPp to backup calls and send them later in larger spurts causing
      "Cant find matching transaction" errors in the logs and some JXTA errors.

      See file
      sf-x2200-11:/space/sony/logs/2.0/b28/subscribe-refresh-RU/IEC1-SuSE/rolling-upgrade-sift-logs/config.RollingUpgrade_testRollingUpgrade/sift/controller.log
      for the exact times at which each Rolling Upgrade step was completed (by
      searching for the exact admin command that is use)

      1. presence-stats_instance10_24x1.txt
        554 kB
        Bhavanishankar
      2. presence-stats.txt
        304 kB
        Bhavanishankar
      3. sipp_screen_log_24x1.log
        7 kB
        Bhavanishankar
      4. subscribe_refresh.log.txt
        17 kB
        Bhavanishankar

        Activity

        varunrupela created issue -
        Hide
        varunrupela added a comment -

        This issue block Rolling Upgrade testing.

        Show
        varunrupela added a comment - This issue block Rolling Upgrade testing.
        Hide
        kshitiz_saxena added a comment -

        Health of instance is based on GMS view. So a disabled instance can still be
        healthy. So instance101 will still show in CLB healthy view.

        However it will not be part of active list and will not be used for any request
        processing.

        Show
        kshitiz_saxena added a comment - Health of instance is based on GMS view. So a disabled instance can still be healthy. So instance101 will still show in CLB healthy view. However it will not be part of active list and will not be used for any request processing.
        Hide
        Bhavanishankar added a comment -

        This looks like a SSR issue, I and Kshitiz are looking into it.

        Show
        Bhavanishankar added a comment - This looks like a SSR issue, I and Kshitiz are looking into it.
        Hide
        Bhavanishankar added a comment -

        This issue is reported with the following configuration:

        instance1, instance2, instance3, instance4, instance5, instance6, .....,
        instance14

        where instances1-4 used as CLB Front End only, and instances5-6 used as CLB
        BackEnd only.

        This is not a typical supported configuration.

        So, the RU testing can be done with a valid configuration where all the
        instances are used as both front-ends and back-ends.

        However, I am fixing the expat related issue in SSR for your configuration. The
        fix is at http://fisheye5.atlassian.com/cru/SFIN-127. With this fix, the things
        should work fine even for your configuration.

        But as I mentioned before, please continue testing with a valid configuration to
        see if there are any other issues.

        Show
        Bhavanishankar added a comment - This issue is reported with the following configuration: instance1, instance2, instance3, instance4, instance5, instance6, ....., instance14 where instances1-4 used as CLB Front End only, and instances5-6 used as CLB BackEnd only. This is not a typical supported configuration. So, the RU testing can be done with a valid configuration where all the instances are used as both front-ends and back-ends. However, I am fixing the expat related issue in SSR for your configuration. The fix is at http://fisheye5.atlassian.com/cru/SFIN-127 . With this fix, the things should work fine even for your configuration. But as I mentioned before, please continue testing with a valid configuration to see if there are any other issues.
        Hide
        Bhavanishankar added a comment -

        Fixed the expat calculation issue with the cluster configuration having

        instance1, instance2, instance3, instance4, instance5, instance6, .....,instance14

        where instances1-4 used as CLB Front End only, and instances5-6 used as CLB
        BackEnd only.

        With this fix, I verified that the Rolling Upgrade works fine with :

        (a) subscribe refresh @ 233 cps with all the instances as clb
        frontend+backend, and rolled the instances where sipp is not connected.

        (b) subscribe refresh @ 233 cps with some instances acting purely as CLB
        front-ends. Rolled all other instances.

        In both cases : response time <10 milliseconds, no spikes in traffic, no can't
        find matching errors.

        The check-in details are at

        https://glassfish.dev.java.net/servlets/ReadMsg?list=cvs&msgNo=31115

        Show
        Bhavanishankar added a comment - Fixed the expat calculation issue with the cluster configuration having instance1, instance2, instance3, instance4, instance5, instance6, .....,instance14 where instances1-4 used as CLB Front End only, and instances5-6 used as CLB BackEnd only. With this fix, I verified that the Rolling Upgrade works fine with : (a) subscribe refresh @ 233 cps with all the instances as clb frontend+backend, and rolled the instances where sipp is not connected. (b) subscribe refresh @ 233 cps with some instances acting purely as CLB front-ends. Rolled all other instances. In both cases : response time <10 milliseconds, no spikes in traffic, no can't find matching errors. The check-in details are at https://glassfish.dev.java.net/servlets/ReadMsg?list=cvs&msgNo=31115
        Hide
        Bhavanishankar added a comment -

        The fix will be available in v2 b30.

        Show
        Bhavanishankar added a comment - The fix will be available in v2 b30.
        Hide
        Bhavanishankar added a comment -

        Created an attachment (id=1089)
        attaching the screen snapshot of rolling upgrade 12 hour run

        Show
        Bhavanishankar added a comment - Created an attachment (id=1089) attaching the screen snapshot of rolling upgrade 12 hour run
        Hide
        Bhavanishankar added a comment -

        Created an attachment (id=1090)
        attaching the presence-stats file of rolling upgrade 12 hour run

        Show
        Bhavanishankar added a comment - Created an attachment (id=1090) attaching the presence-stats file of rolling upgrade 12 hour run
        Hide
        varunrupela added a comment -

        The un-even distribution seems to continue to be a problem on one of the setups
        (8 core). Bhavani is looking into the root cause.

        Show
        varunrupela added a comment - The un-even distribution seems to continue to be a problem on one of the setups (8 core). Bhavani is looking into the root cause.
        Hide
        Bhavanishankar added a comment -

        With my previous fix, I had made sure that the RU worked well on 4-core setup,
        hence i had marked this as fixed.

        But later I realized that there was a thread blocking issue which is more
        frequently seen on 8-core setup (very rarely seen on 4-core setups), which was
        causing the uneven traffic, can't find matching, call backups, etc.

        I have filed & fixed the threading issue as part of issue 1943. Please refer it
        for the complete details.

        With 1943 fix, I verified that the RU works fine in both 4-core and 8-core
        setups. Hence, marking the issue as fixed.

        Show
        Bhavanishankar added a comment - With my previous fix, I had made sure that the RU worked well on 4-core setup, hence i had marked this as fixed. But later I realized that there was a thread blocking issue which is more frequently seen on 8-core setup (very rarely seen on 4-core setups), which was causing the uneven traffic, can't find matching, call backups, etc. I have filed & fixed the threading issue as part of issue 1943. Please refer it for the complete details. With 1943 fix, I verified that the RU works fine in both 4-core and 8-core setups. Hence, marking the issue as fixed.
        Hide
        Bhavanishankar added a comment -

        Created an attachment (id=1097)
        one of the sipp screen log of 24x1 RU verification run (there were totally 4 sipps).

        Show
        Bhavanishankar added a comment - Created an attachment (id=1097) one of the sipp screen log of 24x1 RU verification run (there were totally 4 sipps).
        Hide
        Bhavanishankar added a comment -

        Created an attachment (id=1098)
        24x1 traffic distribution in one of the rolled instance in 8-core 10 inst cluster (6 instances were rolled).

        Show
        Bhavanishankar added a comment - Created an attachment (id=1098) 24x1 traffic distribution in one of the rolled instance in 8-core 10 inst cluster (6 instances were rolled).
        kenaiadmin made changes -
        Field Original Value New Value
        issue.field.bugzillaimportkey 1927 19179

          People

          • Assignee:
            Bhavanishankar
            Reporter:
            varunrupela
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: