[SAILFIN-2016] High traffic loss (SessionExpired messages) observed in the converged app system test Created: 08/Oct/09  Updated: 25/Nov/10

Status: Open
Project: sailfin
Component/s: session_replication
Affects Version/s: 2.0
Fix Version/s: milestone 1

Type: Bug Priority: Minor
Reporter: varunrupela Assignee: Mahesh Kannan
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive http-client-logs.zip     Zip Archive instance-logs.zip     Zip Archive sipp-client-logs.zip     Zip Archive stats-and-domain-logs.zip    
Issue Links:
Dependency
blocks SAILFIN-1983 [umbrella] 24x1/7 Converged App ST wi... Resolved
Issuezilla Id: 2,016
Tags: system-test

 Description   

Build 31d
Parent Issue - 1983

Session loss for the converged app scenario is partly indicated by the number of
"SessionExpired" messages in the server logs. On loss of SIP or HTTP requests,
the calls do not complete which in turn leads to sessions not invalidating
correctly and that causes the session to expire. The "SessionExpired" message is
being printed from the applications call back listener.

~2300 "SessionExpired" messages were observed in the cluster. This number is
high with respect to the total number of active sessions, which is ~30K
(call-rate * call-length * no-of-clients = 37 * 90s * 9 = 29970)

Some of the items to look into that may help with debugging the issue further:
1. SessionExpired messages can be co-related to 3 kinds of errors on the SIPp
clients: 503 responses, 481 response and timeouts.
2. The SIPp client error logs show that some of the the BYE requests timeout
upto a minute after the first one that times-out. We expect to only see a few
BYEs timing out 30 seconds after the failure (since the SIPp scenario has a 30
second timeout on the receive of 200 OK for the BYE). Only a few should timeout
as a timeout would be expected if the BYE request reached the failed instance
just before it failed or was lost in the pipeline...

Note that the SessionExpired message appear on all instance (other than the
failed/restarted one) for about 10 minutes after failure detection. This is
because of the session-timeout value being set to 10 min for the app.

Some Numbers:

  • "SessionExpired" messages in the cluster = 2286
  • SIPp client errors = 1078 (no-of-481-responses - 142, no-of-503-responses -
    479, no-of-timeouts - 437)
  • HTTP client failed sessions = 854
  • Total reported client errors = 1078 + 854 = 1932

Now, some remove-replica messages are also expected to be lost. For this run the
FLUSH_INTERVAL_FOR_REMOVAL_MILLIS was set to 1000 (i.e. 1 second). If we
consider 2 second worth of remove-replica's being lost, there should be a
maximum of ((37 SIP + 37 HTTP) * 2) = 148 SessionExpired messages caused by this.



 Comments   
Comment by varunrupela [ 08/Oct/09 ]

Created an attachment (id=1135)
instance logs

Comment by varunrupela [ 08/Oct/09 ]

Created an attachment (id=1136)
sipp client logs

Comment by varunrupela [ 08/Oct/09 ]

Created an attachment (id=1137)
http client logs

Comment by varunrupela [ 08/Oct/09 ]

Created an attachment (id=1138)
stats and domain logs

Comment by varunrupela [ 08/Oct/09 ]

updated dependency and added keyword

Comment by Mahesh Kannan [ 08/Oct/09 ]

Here is a preliminary analysis:

1. First of all, setting FLUSH_INTERVAL_FOR_REMOVAL_MILLIS to 2 seconds doesn't
necessarily cause the removal thread to flush the batched remove commands every
2 seconds (particularly under load).

2. Secondly, We use a single timer to manage the timer tasks that sends these
remove commands. In other words, removes for SAS, SS and HTTP are all managed by
independent TimerTasks but they are all run by a single timer thread. So it is
possible that multiple batched remove commands might be lost when the instance
is killed.

3. Varun: can you explain what you mean by the following:
<quote>
Session loss for the converged app scenario is partly indicated by the number of
"SessionExpired" messages in the server logs. On loss of SIP or HTTP requests,
the calls do not complete which in turn leads to sessions not invalidating
correctly and that causes the session to expire
</quote>

If SIP and HTTP sessions were really lost, then would we have seen other types
of exceptions (like null SIP / HTTP sessions)? I feel that these are purely due
to the missed SIP, HTTP and SAS remove commands.

Comment by varunrupela [ 09/Oct/09 ]
  • corrected the synopsis.

The SessionExpired messages really indicate http or sip message loss and not
session loss. You are right Mahesh, if the session were lost then we would have
seen some more null messages in the logs.

Apart from the loss of remove replica messages, the SessionExpired messages
(i.e. traffic loss) are also related to the 503 responses and the timeouts on
BYE/200-OK.

The 481 responses though indicate session loss and so should be related to the
"unexpectedly null" messages in the logs.

Comment by varunrupela [ 09/Oct/09 ]

The last debug run with a default FLUSH_INTERVAL_FOR_REMOVAL_MILLIS setting
produce the following numbers:

  • SessionExpired messages - 4
      • This issue has been marked as a duplicate of 322 ***
Comment by varunrupela [ 09/Oct/09 ]

Closed the issue by mistake ..... while updating it with info on the last run..

Here is update:
The last debug run with a default FLUSH_INTERVAL_FOR_REMOVAL_MILLIS setting
produce the following numbers:

  • SessionExpired messages - 4322
  • "unexpectedly null" - 62
  • "IllegalState" - 0
  • 503 Responses - 519
  • 481 Responses - 62
  • timeouts - 384
Comment by Mahesh Kannan [ 11/Oct/09 ]

Looks like SASTimerMigrationTasks are not unregistered upon bulk removal.
Currently preparing a patch and will update the bug after I test it...

Comment by prasads [ 13/Oct/09 ]

Temporarily marking these issues as P4, for the Sailfin 2.0 release. The
priority will be restored for the next release.

Generated at Sat Apr 29 12:40:27 UTC 2017 using JIRA 6.2.3#6260-sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.