Issue Details (XML | Word | Printable)

Key: SAILFIN-2016
Type: Bug Bug
Status: Open Open
Priority: Minor Minor
Assignee: Mahesh Kannan
Reporter: varunrupela
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
sailfin

High traffic loss (SessionExpired messages) observed in the converged app system test

Created: 08/Oct/09 11:31 AM   Updated: 25/Nov/10 06:40 PM
Component/s: session_replication
Affects Version/s: 2.0
Fix Version/s: milestone 1

Time Tracking:
Not Specified

File Attachments: 1. Zip Archive http-client-logs.zip (7.68 MB) 08/Oct/09 11:56 AM - varunrupela
2. Zip Archive instance-logs.zip (365 kB) 08/Oct/09 11:53 AM - varunrupela
3. Zip Archive sipp-client-logs.zip (81 kB) 08/Oct/09 11:54 AM - varunrupela
4. Zip Archive stats-and-domain-logs.zip (1.28 MB) 08/Oct/09 11:57 AM - varunrupela

Environment:

Operating System: All
Platform: All

Issue Links:
Dependency
 

Issuezilla Id: 2,016
Tags: system-test
Participants: Mahesh Kannan, prasads and varunrupela


 Description  « Hide

Build 31d
Parent Issue - 1983

Session loss for the converged app scenario is partly indicated by the number of
"SessionExpired" messages in the server logs. On loss of SIP or HTTP requests,
the calls do not complete which in turn leads to sessions not invalidating
correctly and that causes the session to expire. The "SessionExpired" message is
being printed from the applications call back listener.

~2300 "SessionExpired" messages were observed in the cluster. This number is
high with respect to the total number of active sessions, which is ~30K
(call-rate * call-length * no-of-clients = 37 * 90s * 9 = 29970)

Some of the items to look into that may help with debugging the issue further:
1. SessionExpired messages can be co-related to 3 kinds of errors on the SIPp
clients: 503 responses, 481 response and timeouts.
2. The SIPp client error logs show that some of the the BYE requests timeout
upto a minute after the first one that times-out. We expect to only see a few
BYEs timing out 30 seconds after the failure (since the SIPp scenario has a 30
second timeout on the receive of 200 OK for the BYE). Only a few should timeout
as a timeout would be expected if the BYE request reached the failed instance
just before it failed or was lost in the pipeline...

Note that the SessionExpired message appear on all instance (other than the
failed/restarted one) for about 10 minutes after failure detection. This is
because of the session-timeout value being set to 10 min for the app.

Some Numbers:

  • "SessionExpired" messages in the cluster = 2286
  • SIPp client errors = 1078 (no-of-481-responses - 142, no-of-503-responses -
    479, no-of-timeouts - 437)
  • HTTP client failed sessions = 854
  • Total reported client errors = 1078 + 854 = 1932

Now, some remove-replica messages are also expected to be lost. For this run the
FLUSH_INTERVAL_FOR_REMOVAL_MILLIS was set to 1000 (i.e. 1 second). If we
consider 2 second worth of remove-replica's being lost, there should be a
maximum of ((37 SIP + 37 HTTP) * 2) = 148 SessionExpired messages caused by this.



varunrupela added a comment - 08/Oct/09 11:53 AM

Created an attachment (id=1135)
instance logs


varunrupela added a comment - 08/Oct/09 11:54 AM

Created an attachment (id=1136)
sipp client logs


varunrupela added a comment - 08/Oct/09 11:56 AM

Created an attachment (id=1137)
http client logs


varunrupela added a comment - 08/Oct/09 11:57 AM

Created an attachment (id=1138)
stats and domain logs


varunrupela added a comment - 08/Oct/09 12:00 PM

updated dependency and added keyword


Mahesh Kannan added a comment - 08/Oct/09 08:03 PM

Here is a preliminary analysis:

1. First of all, setting FLUSH_INTERVAL_FOR_REMOVAL_MILLIS to 2 seconds doesn't
necessarily cause the removal thread to flush the batched remove commands every
2 seconds (particularly under load).

2. Secondly, We use a single timer to manage the timer tasks that sends these
remove commands. In other words, removes for SAS, SS and HTTP are all managed by
independent TimerTasks but they are all run by a single timer thread. So it is
possible that multiple batched remove commands might be lost when the instance
is killed.

3. Varun: can you explain what you mean by the following:
<quote>
Session loss for the converged app scenario is partly indicated by the number of
"SessionExpired" messages in the server logs. On loss of SIP or HTTP requests,
the calls do not complete which in turn leads to sessions not invalidating
correctly and that causes the session to expire
</quote>

If SIP and HTTP sessions were really lost, then would we have seen other types
of exceptions (like null SIP / HTTP sessions)? I feel that these are purely due
to the missed SIP, HTTP and SAS remove commands.


varunrupela added a comment - 09/Oct/09 03:57 AM
  • corrected the synopsis.

The SessionExpired messages really indicate http or sip message loss and not
session loss. You are right Mahesh, if the session were lost then we would have
seen some more null messages in the logs.

Apart from the loss of remove replica messages, the SessionExpired messages
(i.e. traffic loss) are also related to the 503 responses and the timeouts on
BYE/200-OK.

The 481 responses though indicate session loss and so should be related to the
"unexpectedly null" messages in the logs.


varunrupela added a comment - 09/Oct/09 04:12 AM

The last debug run with a default FLUSH_INTERVAL_FOR_REMOVAL_MILLIS setting
produce the following numbers:

  • SessionExpired messages - 4
      • This issue has been marked as a duplicate of 322 ***

varunrupela added a comment - 09/Oct/09 04:15 AM

Closed the issue by mistake ..... while updating it with info on the last run..

Here is update:
The last debug run with a default FLUSH_INTERVAL_FOR_REMOVAL_MILLIS setting
produce the following numbers:

  • SessionExpired messages - 4322
  • "unexpectedly null" - 62
  • "IllegalState" - 0
  • 503 Responses - 519
  • 481 Responses - 62
  • timeouts - 384

Mahesh Kannan added a comment - 11/Oct/09 01:39 PM

Looks like SASTimerMigrationTasks are not unregistered upon bulk removal.
Currently preparing a patch and will update the bug after I test it...


prasads added a comment - 13/Oct/09 03:34 AM

Temporarily marking these issues as P4, for the Sailfin 2.0 release. The
priority will be restored for the next release.