sailfin
  1. sailfin
  2. SAILFIN-1794

conference app, kill/start an inst. Errors in server.log, http sessions got OOM.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 2.0
    • Fix Version/s: milestone 1
    • Component/s: session_replication
    • Labels:
      None
    • Environment:

      Operating System: Linux
      Platform: All

      Description

      ******************************************************************************************************

      • Template v0.1 ( 05/01/08 )
      • Sailfin Stress test issue
        ******************************************************************************************************
        Sailfin Build : 16
        Cluster size : 10
        Happens in a single instance (y/n) ? : NA
        Test id : st5_1_conference_uac
        Location of the test : as-telco-sqe/stress-ws/conference
        JDK version : 64 bit - 1.6.0_12
        CLB used : Yes
        HW LB used : No.
        SSR: Enabled

      Started 9 SIPp clients with the following command to effectively get 100cps per
      instance in the 10 instance cluster.

      sipp command :
      sipp -t t1 -sf st5_1_conference_uac-failure.xml -r 111 -l 10000 -nd -trace_err
      -trace_screen -buff_size 33554432 -reconnect_close false -reconnect_sleep 1000
      -p <common-port-number-x> <sailfin_host:sip_port>

      Started 9 http clients with the following command to effectively get 100cps per
      instance in the 10 instance cluster.

      http client command:
      ant test1 -Dtest.args="<sailfin_host> <sailfin_http_port> 10
      <common-port-number-x> 111 10000
      *****************************************************************************

      Scenario Steps:
      1. Start all SIPp clients and HTTP clients (Follow the README at the test app
      location)
      2. After about 36 hours of the runing, kill the instance that is not associated
      with any client (in my case an instance4)
      3. After about 10 hours, restart the failed instance
      4. Let the run to go forward.

      ***********

      I'm opening a new bug because I don't see any exceptions that were described in
      the SAILFIN-1739. Were used SuSE machines

      1) During first 36 hours of running (before an instance was killed) the run was
      perfect.
      2) When an instance was killed.
      a) First in the server.log files of other instances I saw, I believe,
      reasonable "Connection refused" messages, for example:
      ===================================================================
      [#|2009-05-31T07:24:19.436-0700|SEVERE|sun-glassfish-comms-server1.5|
      javax.enterprise.system.container.clb|_ThreadID=50;_ThreadName=http-proxy-outboundWorkerThread-8080-10
      ;_RequestID=0a42665e-6d7a-4de1-a820-7656f56365b8;|WorkerThreadImpl unexpected
      exception: java.lang.RuntimeException: java.net.ConnectException: Connection refused
      at org.jvnet.glassfish.comms.clb.proxy.outbound.DefaultCallBackHandler.onConnect
      (DefaultCallBackHandler.java:157)
      ================================================
      (see the full messages in log files)

      b) But then I saw such error message:
      ===================================================================
      [#|2009-05-31T07:44:46.884-0700|WARNING|sun-glassfish-comms-server1.5|
      Conference|_ThreadID=70;_ThreadName=httpSSLWorkerThread-11808-8
      ;_RequestID=050c838b-4786-4cea-ab5e-e09d70af8afb;|ERROR: processSecondPage
      CHS==null values from request: sasKey null|sessionID =
      70ee2f31c4139887b623f33e1b52|getContextPath /conference|getParameterMap

      {page=[Ljava.lang.String;@28f35ba}

      |getPathInfo null|getPathTranslated
      null|getQueryString null|getRequestURI
      /conference/ConferenceHttpServlet|getRequestURL
      http://bigapp-oblade-7.sfbay.sun.com:11808/conference/ConferenceHttpServlet|getServerName
      bigapp-oblade-7.sfbay.sun.com|getLocalPort 11808|getProtocol HTTP/1.1|getScheme
      http|parameters page=second||#]

      [#|2009-05-31T07:44:46.895-0700|SEVERE|sun-glassfish-comms-server1.5|Conference|_ThreadID=70
      ;_ThreadName=httpSSLWorkerThread-11808-8;_RequestID=050c838b-4786-4cea-ab5e-e09d70af8afb;|ERROR:
      Exception processing the following request, page = second , sasKey = null,
      sessionID = 70ee2f31c4139887b623f33e1b52
      java.io.IOException: processSecondPage: ConvergedHttpSession unexpectedly null
      ==================================================================

      (see the full messages in the log files)

      During about 10 hours between an instance was killed and started. Such messages
      did not stop. And more than 200 server.log files per an instance were created
      all of them contained these messages.

      c) When an instance was killed I started to see in http logs "Fail Sessions"
      Were created about 30 "fail sessions" per an instance. I believe it is OK.

      3) The instance was restarted.

      a) After that a flood of the error messages in server.log files did not stop.
      And I've started to see not only "processSecondPage" errors, but also:
      ========================================================================
      [#|2009-06-01T00:20:56.428-0700|WARNING|sun-glassfish-comms-server1.5|Conference

      _ThreadID=69;_ThreadName=httpSSLWorkerThread-11808-2;_RequestID=b90ac6fd-898f-4e66-8117-519b686c2a77; ERROR:
      processFinalPage CHS == null values from request: sasKey null
      sessionID =
      aa518ecee260f89176f894470703
      getContextPath /conference getParameterMap {page=[Ljava.lang.String;@2b351dc} getPathInfo null getPathTranslated
      null
      getQueryString null getRequestURI
      /conference/ConferenceHttpServlet
      getRequestURL
      http://bigapp-oblade-8.sfbay.sun.com:11808/conference/ConferenceHttpServlet
      getServerName
      bigapp-oblade-8.sfbay.sun.com
      getLocalPort 11808 getProtocol HTTP/1.1 getScheme
      http
      parameters page=third #]

      [#|2009-06-01T00:20:56.428-0700|SEVERE|sun-glassfish-comms-server1.5|Conference|
      _ThreadID=69;_ThreadName=httpSSLWorkerThread-11808-2;_RequestID=b90ac6fd-898f-4e66-8117-519b686c2a77;|ERROR:
      Exception processing the following request, page = third , sasKey = null,
      sessionID = aa518ecee260f89176f894470703
      java.io.IOException: processFinalPage: CHS unexpectedly null
      ===============================================================

      For restarted instance were created about 6500 server.log files with
      "processFinalPage" and "processSecondPage" errors.

      b) When an instance was restarted the number of http "FAIL Sessions" became to
      grow and reached about 300 failed sessions per an instance.

      c) And then after about 10 hours since the instance was restarted, I saw OOM
      messages in http logs. The test stopped at this point.

      The machine from which I was running http sessions, had 16 GB RAM.

      I want to add that when I've killed all http sessions. I found that two http
      processes still exist and take about 5 GB of RAM.

      Bottom line
      1) I believe that the flood of error messages should not be seen in server.log
      files.
      2) I'm concerning that when an instance was restarted I saw the degradation of
      the run and finally http sessions failed with OOM.

      Please see all logs at:

      http://agni-1.sfbay.sun.com/net/asqe-logs/export1/SailFin/Results/build16/conf_ssr_kill

        Activity

        Hide
        easarina added a comment -

        Was used conference app.

        Show
        easarina added a comment - Was used conference app.
        Hide
        easarina added a comment -

        Was added a keyword system-test

        Show
        easarina added a comment - Was added a keyword system-test
        Hide
        Scott Oaks added a comment -

        Sony and I have discussed the need for a unified http/sip driver, as part of the
        issue here is the requests get out of sync. However, given the state of changes
        in SSR since build 16, this needs to be re-run and re-evaluated.

        Show
        Scott Oaks added a comment - Sony and I have discussed the need for a unified http/sip driver, as part of the issue here is the requests get out of sync. However, given the state of changes in SSR since build 16, this needs to be re-run and re-evaluated.
        Hide
        shreedhar_ganapathy added a comment -

        ..

        Show
        shreedhar_ganapathy added a comment - ..
        Hide
        shreedhar_ganapathy added a comment -

        Fixes have gone in for Issue #1926
        Marking this issue as fixed. Please reopen if rerun of this scenario with b30 reproduces this problem

            • This issue has been marked as a duplicate of 1926 ***
        Show
        shreedhar_ganapathy added a comment - Fixes have gone in for Issue #1926 Marking this issue as fixed. Please reopen if rerun of this scenario with b30 reproduces this problem This issue has been marked as a duplicate of 1926 ***

          People

          • Assignee:
            Mahesh Kannan
            Reporter:
            easarina
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: