Issue Details (XML | Word | Printable)

Key: SAILFIN-1927
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Critical Critical
Assignee: Bhavanishankar
Reporter: varunrupela
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
sailfin

[blocking] Disable/Enable of an instance causing uneven traffic distribution

Created: 18/Aug/09 02:45 AM   Updated: 31/Aug/09 01:26 AM   Resolved: 31/Aug/09 01:26 AM
Component/s: session_replication
Affects Version/s: 2.0
Fix Version/s: b30

Time Tracking:
Not Specified

File Attachments: 1. Text File presence-stats.txt (304 kB) 25/Aug/09 01:10 AM - Bhavanishankar
2. Text File presence-stats_instance10_24x1.txt (554 kB) 31/Aug/09 01:26 AM - Bhavanishankar
3. Text File sipp_screen_log_24x1.log (7 kB) 31/Aug/09 01:23 AM - Bhavanishankar
4. Text File subscribe_refresh.log.txt (17 kB) 25/Aug/09 01:09 AM - Bhavanishankar

Environment:

Operating System: All
Platform: All


Issuezilla Id: 1,927
Tags: system-test
Participants: Bhavanishankar, kshitiz_saxena and varunrupela


 Description  « Hide

******************************************************************************************************

  • Template v0.1 ( 05/01/08 )
  • Sailfin Stress test issue
    ******************************************************************************************************
    Sailfin Build : 28
    Cluster size : 9
    Happens in a single instance (y/n) ? : NA
    Test id : st2_4_presence_subscribe-refresh-failure
    Location of the test : as-telco-sqe/stress-ws/presence
    JDK version : 1.6.0_16
    CLB used : Yes
    HW LB used : Yes.
    SSR: Enabled

A 9 instance cluster was used in this test. 4 instances were used as FE/BE
(instance103, instance104, instance105 and instance109), the other 5 as BE
alone. 4 SIPp instances were started with the following command to get an
effective call rate of 140cps and load 150k sessions per instance:

sipp -t t1 -sf st2_4_presence_subscribe-refresh-failure.xml -r 315 -l 338000 -d
1073000 -nd -trace_err -trace_screen -trace_logs -buff_size 33554433
-reconnect_close false -max_reconnect 10 -reconnect_sleep 3000
<instance-host>:35060

Rolling Upgrade steps were performed on instance101.
http://wiki.glassfish.java.net/attach/SFv2FunctionalSpecs/rolling_upgrade_one_pager_ver2.html

Logs for the run are available at:
sf-x2200-11:/space/sony/logs/2.0/b28/subscribe-refresh-RU/IEC1-SuSE/

Issue:
On running disable-converged-lb-server instance101 (this was done at time -
11:42:51):
a. All other instances saw a drop in traffic (See the presence-stats.txt file
under the instance logs)
b. instance101 still shows up as a healthy instance in the server logs of other
instances.

On running enable-converged-lb-server instance101 (done at time 11:54:28)
traffic distribution was quite un-even for an extended period of time. To some
instances traffic re-started only after the reconcile step was completed.

This caused SIPp to backup calls and send them later in larger spurts causing
"Cant find matching transaction" errors in the logs and some JXTA errors.

See file
sf-x2200-11:/space/sony/logs/2.0/b28/subscribe-refresh-RU/IEC1-SuSE/rolling-upgrade-sift-logs/config.RollingUpgrade_testRollingUpgrade/sift/controller.log
for the exact times at which each Rolling Upgrade step was completed (by
searching for the exact admin command that is use)



varunrupela added a comment - 18/Aug/09 02:53 AM

This issue block Rolling Upgrade testing.


kshitiz_saxena added a comment - 19/Aug/09 01:08 AM

Health of instance is based on GMS view. So a disabled instance can still be
healthy. So instance101 will still show in CLB healthy view.

However it will not be part of active list and will not be used for any request
processing.


Bhavanishankar added a comment - 21/Aug/09 10:45 AM

This looks like a SSR issue, I and Kshitiz are looking into it.


Bhavanishankar added a comment - 21/Aug/09 11:04 AM

This issue is reported with the following configuration:

instance1, instance2, instance3, instance4, instance5, instance6, .....,
instance14

where instances1-4 used as CLB Front End only, and instances5-6 used as CLB
BackEnd only.

This is not a typical supported configuration.

So, the RU testing can be done with a valid configuration where all the
instances are used as both front-ends and back-ends.

However, I am fixing the expat related issue in SSR for your configuration. The
fix is at http://fisheye5.atlassian.com/cru/SFIN-127. With this fix, the things
should work fine even for your configuration.

But as I mentioned before, please continue testing with a valid configuration to
see if there are any other issues.


Bhavanishankar added a comment - 24/Aug/09 11:37 AM

Fixed the expat calculation issue with the cluster configuration having

instance1, instance2, instance3, instance4, instance5, instance6, .....,instance14

where instances1-4 used as CLB Front End only, and instances5-6 used as CLB
BackEnd only.

With this fix, I verified that the Rolling Upgrade works fine with :

(a) subscribe refresh @ 233 cps with all the instances as clb
frontend+backend, and rolled the instances where sipp is not connected.

(b) subscribe refresh @ 233 cps with some instances acting purely as CLB
front-ends. Rolled all other instances.

In both cases : response time <10 milliseconds, no spikes in traffic, no can't
find matching errors.

The check-in details are at

https://glassfish.dev.java.net/servlets/ReadMsg?list=cvs&msgNo=31115


Bhavanishankar added a comment - 24/Aug/09 11:38 AM

The fix will be available in v2 b30.


Bhavanishankar added a comment - 25/Aug/09 01:09 AM

Created an attachment (id=1089)
attaching the screen snapshot of rolling upgrade 12 hour run


Bhavanishankar added a comment - 25/Aug/09 01:10 AM

Created an attachment (id=1090)
attaching the presence-stats file of rolling upgrade 12 hour run


varunrupela added a comment - 27/Aug/09 09:18 AM

The un-even distribution seems to continue to be a problem on one of the setups
(8 core). Bhavani is looking into the root cause.


Bhavanishankar added a comment - 31/Aug/09 01:14 AM

With my previous fix, I had made sure that the RU worked well on 4-core setup,
hence i had marked this as fixed.

But later I realized that there was a thread blocking issue which is more
frequently seen on 8-core setup (very rarely seen on 4-core setups), which was
causing the uneven traffic, can't find matching, call backups, etc.

I have filed & fixed the threading issue as part of issue 1943. Please refer it
for the complete details.

With 1943 fix, I verified that the RU works fine in both 4-core and 8-core
setups. Hence, marking the issue as fixed.


Bhavanishankar added a comment - 31/Aug/09 01:23 AM

Created an attachment (id=1097)
one of the sipp screen log of 24x1 RU verification run (there were totally 4 sipps).


Bhavanishankar added a comment - 31/Aug/09 01:26 AM

Created an attachment (id=1098)
24x1 traffic distribution in one of the rolled instance in 8-core 10 inst cluster (6 instances were rolled).