Parent Issue - 1983
Session loss for the converged app scenario is partly indicated by the number of
"SessionExpired" messages in the server logs. On loss of SIP or HTTP requests,
the calls do not complete which in turn leads to sessions not invalidating
correctly and that causes the session to expire. The "SessionExpired" message is
being printed from the applications call back listener.
~2300 "SessionExpired" messages were observed in the cluster. This number is
high with respect to the total number of active sessions, which is ~30K
(call-rate * call-length * no-of-clients = 37 * 90s * 9 = 29970)
Some of the items to look into that may help with debugging the issue further:
1. SessionExpired messages can be co-related to 3 kinds of errors on the SIPp
clients: 503 responses, 481 response and timeouts.
2. The SIPp client error logs show that some of the the BYE requests timeout
upto a minute after the first one that times-out. We expect to only see a few
BYEs timing out 30 seconds after the failure (since the SIPp scenario has a 30
second timeout on the receive of 200 OK for the BYE). Only a few should timeout
as a timeout would be expected if the BYE request reached the failed instance
just before it failed or was lost in the pipeline...
Note that the SessionExpired message appear on all instance (other than the
failed/restarted one) for about 10 minutes after failure detection. This is
because of the session-timeout value being set to 10 min for the app.
- "SessionExpired" messages in the cluster = 2286
- SIPp client errors = 1078 (no-of-481-responses - 142, no-of-503-responses -
479, no-of-timeouts - 437)
- HTTP client failed sessions = 854
- Total reported client errors = 1078 + 854 = 1932
Now, some remove-replica messages are also expected to be lost. For this run the
FLUSH_INTERVAL_FOR_REMOVAL_MILLIS was set to 1000 (i.e. 1 second). If we
consider 2 second worth of remove-replica's being lost, there should be a
maximum of ((37 SIP + 37 HTTP) * 2) = 148 SessionExpired messages caused by this.