sailfin
  1. sailfin
  2. SAILFIN-2016

High traffic loss (SessionExpired messages) observed in the converged app system test

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.0
    • Fix Version/s: milestone 1
    • Component/s: session_replication
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

      Description

      Build 31d
      Parent Issue - 1983

      Session loss for the converged app scenario is partly indicated by the number of
      "SessionExpired" messages in the server logs. On loss of SIP or HTTP requests,
      the calls do not complete which in turn leads to sessions not invalidating
      correctly and that causes the session to expire. The "SessionExpired" message is
      being printed from the applications call back listener.

      ~2300 "SessionExpired" messages were observed in the cluster. This number is
      high with respect to the total number of active sessions, which is ~30K
      (call-rate * call-length * no-of-clients = 37 * 90s * 9 = 29970)

      Some of the items to look into that may help with debugging the issue further:
      1. SessionExpired messages can be co-related to 3 kinds of errors on the SIPp
      clients: 503 responses, 481 response and timeouts.
      2. The SIPp client error logs show that some of the the BYE requests timeout
      upto a minute after the first one that times-out. We expect to only see a few
      BYEs timing out 30 seconds after the failure (since the SIPp scenario has a 30
      second timeout on the receive of 200 OK for the BYE). Only a few should timeout
      as a timeout would be expected if the BYE request reached the failed instance
      just before it failed or was lost in the pipeline...

      Note that the SessionExpired message appear on all instance (other than the
      failed/restarted one) for about 10 minutes after failure detection. This is
      because of the session-timeout value being set to 10 min for the app.

      Some Numbers:

      • "SessionExpired" messages in the cluster = 2286
      • SIPp client errors = 1078 (no-of-481-responses - 142, no-of-503-responses -
        479, no-of-timeouts - 437)
      • HTTP client failed sessions = 854
      • Total reported client errors = 1078 + 854 = 1932

      Now, some remove-replica messages are also expected to be lost. For this run the
      FLUSH_INTERVAL_FOR_REMOVAL_MILLIS was set to 1000 (i.e. 1 second). If we
      consider 2 second worth of remove-replica's being lost, there should be a
      maximum of ((37 SIP + 37 HTTP) * 2) = 148 SessionExpired messages caused by this.

        Issue Links

          Activity

          Hide
          prasads added a comment -

          Temporarily marking these issues as P4, for the Sailfin 2.0 release. The
          priority will be restored for the next release.

          Show
          prasads added a comment - Temporarily marking these issues as P4, for the Sailfin 2.0 release. The priority will be restored for the next release.
          Hide
          Mahesh Kannan added a comment -

          Looks like SASTimerMigrationTasks are not unregistered upon bulk removal.
          Currently preparing a patch and will update the bug after I test it...

          Show
          Mahesh Kannan added a comment - Looks like SASTimerMigrationTasks are not unregistered upon bulk removal. Currently preparing a patch and will update the bug after I test it...
          Hide
          varunrupela added a comment -

          Closed the issue by mistake ..... while updating it with info on the last run..

          Here is update:
          The last debug run with a default FLUSH_INTERVAL_FOR_REMOVAL_MILLIS setting
          produce the following numbers:

          • SessionExpired messages - 4322
          • "unexpectedly null" - 62
          • "IllegalState" - 0
          • 503 Responses - 519
          • 481 Responses - 62
          • timeouts - 384
          Show
          varunrupela added a comment - Closed the issue by mistake ..... while updating it with info on the last run.. Here is update: The last debug run with a default FLUSH_INTERVAL_FOR_REMOVAL_MILLIS setting produce the following numbers: SessionExpired messages - 4322 "unexpectedly null" - 62 "IllegalState" - 0 503 Responses - 519 481 Responses - 62 timeouts - 384
          Hide
          varunrupela added a comment -

          The last debug run with a default FLUSH_INTERVAL_FOR_REMOVAL_MILLIS setting
          produce the following numbers:

          • SessionExpired messages - 4
              • This issue has been marked as a duplicate of 322 ***
          Show
          varunrupela added a comment - The last debug run with a default FLUSH_INTERVAL_FOR_REMOVAL_MILLIS setting produce the following numbers: SessionExpired messages - 4 This issue has been marked as a duplicate of 322 ***
          Hide
          varunrupela added a comment -
          • corrected the synopsis.

          The SessionExpired messages really indicate http or sip message loss and not
          session loss. You are right Mahesh, if the session were lost then we would have
          seen some more null messages in the logs.

          Apart from the loss of remove replica messages, the SessionExpired messages
          (i.e. traffic loss) are also related to the 503 responses and the timeouts on
          BYE/200-OK.

          The 481 responses though indicate session loss and so should be related to the
          "unexpectedly null" messages in the logs.

          Show
          varunrupela added a comment - corrected the synopsis. The SessionExpired messages really indicate http or sip message loss and not session loss. You are right Mahesh, if the session were lost then we would have seen some more null messages in the logs. Apart from the loss of remove replica messages, the SessionExpired messages (i.e. traffic loss) are also related to the 503 responses and the timeouts on BYE/200-OK. The 481 responses though indicate session loss and so should be related to the "unexpectedly null" messages in the logs.

            People

            • Assignee:
              Mahesh Kannan
              Reporter:
              varunrupela
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated: