sailfin
  1. sailfin
  2. SAILFIN-1436

OOM afer about 3 hours, x86, iptv-reinvite, SSR enabled.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: milestone 1
    • Component/s: session_replication
    • Labels:
      None
    • Environment:

      Operating System: Solaris X86
      Platform: All

      Description

      *****************************************************************************************************

      • Template v0.1 ( 05/01/08 )
      • Sailfin Stress test issue
        ******************************************************************************************************
        Sailfin Build :60
        Cluster size : 8
        Happens in a single instance (y/n) ? :
        Test id : st1_3_iptv_reinvite
        Location of the test : as-telco-sqe/stress-ws/iptv
        JDK version : 1.6.0_07
        CLB used : Yes
        HW LB used : NO
        *******************************************
        SSR enabled

      ==========================================

      x86 machines, one instances per the machine, one sipp per the instance.

      When I've started the test, then during about one hour it was running fine.
      After that I started to see unexpected messages on the sipp screens near
      second 100 and near pause, the unexpected messages near pause created errors in
      sipp error logs: "Continuing call on unexpected BYE for call..."

      In server.log files at that time I did not see any errors.

      But then I've started to see OOM messages in server.log files:
      "OutOfMemoryError: Java Heap space"

      I've reproduced this issue three times. The problem happened after about 3 hours
      of the running.

      I've attached one gc.log.

        Activity

        Hide
        Mahesh Kannan added a comment -

        We ran with a patch that printed the call stack to figure out why replicatio is
        stopped during a run. The last run did run into OOM. The first sign of trouble
        ossured at 05:20 when replication was stopped on n1c1m6 and n1c1m7. Attached
        below are the top, gc stats.

        Also, netstat -s (on x86) showed zero UDP packet loss.

        We are inestigating why we stop replicating during the middle of a run.

        We also have reasoned out why this could cause a memory growth. In this case,
        when n1c1m6 stops replicating, n1c1m7 continues to flood n1c1m6 with lots of
        load advisory calls for SAS expiration. Also, we found out that the background
        task that sends the load advisory requests never stops from sending it till the
        primary saves the SAS (which wont happen in this case as n1c1m6 - the primary-
        has stopped replication)

        n1c1m5
        ------
        Dec-04 05:20 Heap Size: 2.2 GB
        Dec 04 06:09 CMS failure
        Dec 04 04:32 Top: 3.2GB (Process size)
        Dec 04 05:08 Top: 3.2GB (Process size)
        Dec 04 05:44 Top: 3.2GB (Process size)

        n1c1m6
        ------

        Dec-04 05:20 OutOfPipes Replication stopped
        Dec-04 05:20 Heap Size: 2.23 GB
        Dec 04 06:10 CMS failure
        Dec 04 04:32 Top: 3.29GB (Process size)
        Dec 04 05:08 Top: 3.29GB (Process size)
        Dec 04 05:44 Top: 3.29GB (Process size)

        n1c1m7
        ------

        Dec-04 05:20 OutOfPipes Replication stopped
        Dec-04 05:20 Heap Size: 2.2 GB
        Dec 04 05:32 CMS failure
        Dec 04 04:32 Top: 3.4GB (Process size)
        Dec 04 05:08 Top: 3.4GB (Process size)
        Dec 04 05:44 Top: 3.4GB (Process size)

        n1c1m10
        ------

        Dec-04 05:20 Heap Size: 2.2 GB
        Dec 04 05:34 CMS failure
        Dec 04 04:32 Top: 3.4GB (Process size)
        Dec 04 05:08 Top: 3.4GB (Process size)
        Dec 04 05:44 Top: 3.4GB (Process size)

        Show
        Mahesh Kannan added a comment - We ran with a patch that printed the call stack to figure out why replicatio is stopped during a run. The last run did run into OOM. The first sign of trouble ossured at 05:20 when replication was stopped on n1c1m6 and n1c1m7. Attached below are the top, gc stats. Also, netstat -s (on x86) showed zero UDP packet loss. We are inestigating why we stop replicating during the middle of a run. We also have reasoned out why this could cause a memory growth. In this case, when n1c1m6 stops replicating, n1c1m7 continues to flood n1c1m6 with lots of load advisory calls for SAS expiration. Also, we found out that the background task that sends the load advisory requests never stops from sending it till the primary saves the SAS (which wont happen in this case as n1c1m6 - the primary- has stopped replication) n1c1m5 ------ Dec-04 05:20 Heap Size: 2.2 GB Dec 04 06:09 CMS failure Dec 04 04:32 Top: 3.2GB (Process size) Dec 04 05:08 Top: 3.2GB (Process size) Dec 04 05:44 Top: 3.2GB (Process size) n1c1m6 ------ Dec-04 05:20 OutOfPipes Replication stopped Dec-04 05:20 Heap Size: 2.23 GB Dec 04 06:10 CMS failure Dec 04 04:32 Top: 3.29GB (Process size) Dec 04 05:08 Top: 3.29GB (Process size) Dec 04 05:44 Top: 3.29GB (Process size) n1c1m7 ------ Dec-04 05:20 OutOfPipes Replication stopped Dec-04 05:20 Heap Size: 2.2 GB Dec 04 05:32 CMS failure Dec 04 04:32 Top: 3.4GB (Process size) Dec 04 05:08 Top: 3.4GB (Process size) Dec 04 05:44 Top: 3.4GB (Process size) n1c1m10 ------ Dec-04 05:20 Heap Size: 2.2 GB Dec 04 05:34 CMS failure Dec 04 04:32 Top: 3.4GB (Process size) Dec 04 05:08 Top: 3.4GB (Process size) Dec 04 05:44 Top: 3.4GB (Process size)
        Hide
        hamada added a comment -

        add to cc-list

        Show
        hamada added a comment - add to cc-list
        Hide
        Mahesh Kannan added a comment -

        We ran with the patch where JxtaReplicationSender logged the exceptions at INFO
        level (as opposed to FINE).

        During our previous run, we found out that at 17:04 replication was turned off
        because the pipes were closed (through cleanup()). Also at the same time the
        replica machine's server.log shows that "Socket Channel is not connected.
        closing connection"

        Here are the snippets from the two logs:

        On the replica machine:
        -----------------------

        [#|2008-12-08T17:04:32.017-0800|WARNING|sun-glassfish-comms-server1.5|net.jxta.impl.endpoint.tcp.TcpMessenger|_ThreadID=38;_ThreadName=Executor

        • 42;_RequestID=75084d1c-1f8a-4c3c-b51c-0a195395efd2;|SocketChannel is not
          connected. closing connection|#]

        [#|2008-12-08T17:04:32.524-0800|WARNING|sun-glassfish-comms-server1.5|net.jxta.impl.endpoint.tcp.TcpMessenger|_ThreadID=39;_ThreadName=Executor

        • 47;_RequestID=4a3de87e-d495-43bf-81d9-ed5ff0e11b9c;|SocketChannel is not
          connected. closing connection|#]

        On the sender machine:
        ----------------------

        [#|2008-12-08T17:04:04.280-0800|WARNING|sun-glassfish-comms-server1.5|net.jxta.impl.endpoint.tcp.TcpMessenger|_ThreadID=39;_ThreadName=ReplicationSenderExecutor

        • 50;_RequestID=dfd07d89-e63e-4f5a-9c87-5c89dfeaf2b3;|Message send failed for
          net.jxta.endpoint.Message@25664022(14) {126148}
          java.net.SocketTimeoutException: Failed sending
          net.jxta.endpoint.Message@25664022(14){126148}

          to : 10.5.216.30:9701
          at
          net.jxta.impl.endpoint.tcp.TcpMessenger.xmitMessage(TcpMessenger.java:651)
          at
          net.jxta.impl.endpoint.tcp.TcpMessenger.sendMessageDirect(TcpMessenger.java:555)
          at net.jxta.util.JxtaBiDiPipe.sendMessage(JxtaBiDiPipe.java:1029)
          at
          com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.sendOverPipe(JxtaReplicationSender.java:709)
          at
          com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.sendState(JxtaReplicationSender.java:1954)
          at
          com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.sendState(JxtaReplicationSender.java:1934)
          at
          com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.sendBulkMessage(JxtaReplicationSender.java:2004)
          at
          com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.createMessageAndSend(JxtaReplicationSender.java:1926)
          at
          com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.access$1300(JxtaReplicationSender.java:97)
          at
          com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender$BulkMessageSender.run(JxtaReplicationSender.java:2211)
          at
          java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
          at
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
          at java.lang.Thread.run(Thread.java:619)
          Caused by: java.net.SocketTimeoutException: Timeout during socket write
          at net.jxta.impl.endpoint.tcp.TcpMessenger.write(TcpMessenger.java:740)
          at net.jxta.impl.endpoint.tcp.TcpMessenger.write(TcpMessenger.java:679)
          at
          net.jxta.impl.endpoint.tcp.TcpMessenger.xmitMessage(TcpMessenger.java:633)
          ... 12 more

        I am trying to reduce the time taken to reproduce this issue by forcibly
        shutting off replication after 2 hours. Also, Larry has produced a patch
        whereby, upon disabling replication, the sender will try to "reconnect".

        Will update the bug later this evening about the outcome of these two runs

        Show
        Mahesh Kannan added a comment - We ran with the patch where JxtaReplicationSender logged the exceptions at INFO level (as opposed to FINE). During our previous run, we found out that at 17:04 replication was turned off because the pipes were closed (through cleanup()). Also at the same time the replica machine's server.log shows that "Socket Channel is not connected. closing connection" Here are the snippets from the two logs: On the replica machine: ----------------------- [#|2008-12-08T17:04:32.017-0800|WARNING|sun-glassfish-comms-server1.5|net.jxta.impl.endpoint.tcp.TcpMessenger|_ThreadID=38;_ThreadName=Executor 42;_RequestID=75084d1c-1f8a-4c3c-b51c-0a195395efd2;|SocketChannel is not connected. closing connection|#] [#|2008-12-08T17:04:32.524-0800|WARNING|sun-glassfish-comms-server1.5|net.jxta.impl.endpoint.tcp.TcpMessenger|_ThreadID=39;_ThreadName=Executor 47;_RequestID=4a3de87e-d495-43bf-81d9-ed5ff0e11b9c;|SocketChannel is not connected. closing connection|#] On the sender machine: ---------------------- [#|2008-12-08T17:04:04.280-0800|WARNING|sun-glassfish-comms-server1.5|net.jxta.impl.endpoint.tcp.TcpMessenger|_ThreadID=39;_ThreadName=ReplicationSenderExecutor 50;_RequestID=dfd07d89-e63e-4f5a-9c87-5c89dfeaf2b3;|Message send failed for net.jxta.endpoint.Message@25664022(14) {126148} java.net.SocketTimeoutException: Failed sending net.jxta.endpoint.Message@25664022(14){126148} to : 10.5.216.30:9701 at net.jxta.impl.endpoint.tcp.TcpMessenger.xmitMessage(TcpMessenger.java:651) at net.jxta.impl.endpoint.tcp.TcpMessenger.sendMessageDirect(TcpMessenger.java:555) at net.jxta.util.JxtaBiDiPipe.sendMessage(JxtaBiDiPipe.java:1029) at com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.sendOverPipe(JxtaReplicationSender.java:709) at com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.sendState(JxtaReplicationSender.java:1954) at com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.sendState(JxtaReplicationSender.java:1934) at com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.sendBulkMessage(JxtaReplicationSender.java:2004) at com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.createMessageAndSend(JxtaReplicationSender.java:1926) at com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender.access$1300(JxtaReplicationSender.java:97) at com.sun.enterprise.ee.web.sessmgmt.JxtaReplicationSender$BulkMessageSender.run(JxtaReplicationSender.java:2211) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) Caused by: java.net.SocketTimeoutException: Timeout during socket write at net.jxta.impl.endpoint.tcp.TcpMessenger.write(TcpMessenger.java:740) at net.jxta.impl.endpoint.tcp.TcpMessenger.write(TcpMessenger.java:679) at net.jxta.impl.endpoint.tcp.TcpMessenger.xmitMessage(TcpMessenger.java:633) ... 12 more I am trying to reduce the time taken to reproduce this issue by forcibly shutting off replication after 2 hours. Also, Larry has produced a patch whereby, upon disabling replication, the sender will try to "reconnect". Will update the bug later this evening about the outcome of these two runs
        Hide
        lwhite added a comment -

        With these fixes, the x86 ran for around 24 hours smoothly.

        Fix checked into tip: SJSAS91_FCS_BRANCH

        Checking in FailureNotificationEventHandler.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/FailureNotificationEventHandler.java,v
        <-- FailureNotificationEventHandler.java
        new revision: 1.5.6.6; previous revision: 1.5.6.5
        done
        Checking in JxtaBiDiPipeWrapper.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/JxtaBiDiPipeWrapper.java,v
        <-- JxtaBiDiPipeWrapper.java
        new revision: 1.17.2.22; previous revision: 1.17.2.21
        done
        Checking in JxtaReplicationSender.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/JxtaReplicationSender.java,v
        <-- JxtaReplicationSender.java
        new revision: 1.18.2.27; previous revision: 1.18.2.26
        done
        Checking in PipeWrapper.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/PipeWrapper.java,v
        <-- PipeWrapper.java
        new revision: 1.8.2.14; previous revision: 1.8.2.13
        done
        Checking in ReplicationHealthChecker.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/ReplicationHealthChecker.java,v
        <-- ReplicationHealthChecker.java
        new revision: 1.9.2.23; previous revision: 1.9.2.22
        done

        Fix checked into FCS branch: SGES21_FCS_BRANCH

        Checking in FailureNotificationEventHandler.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/FailureNotificationEventHandler.java,v
        <-- FailureNotificationEventHandler.java
        new revision: 1.5.6.5.4.1; previous revision: 1.5.6.5
        done
        Checking in JxtaBiDiPipeWrapper.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/JxtaBiDiPipeWrapper.java,v
        <-- JxtaBiDiPipeWrapper.java
        new revision: 1.17.2.18.2.4; previous revision: 1.17.2.18.2.3
        done
        Checking in JxtaReplicationSender.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/JxtaReplicationSender.java,v
        <-- JxtaReplicationSender.java
        new revision: 1.18.2.24.2.3; previous revision: 1.18.2.24.2.2
        done
        Checking in PipeWrapper.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/PipeWrapper.java,v
        <-- PipeWrapper.java
        new revision: 1.8.2.13.2.1; previous revision: 1.8.2.13
        done
        Checking in ReplicationHealthChecker.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/ReplicationHealthChecker.java,v
        <-- ReplicationHealthChecker.java
        new revision: 1.9.2.18.2.4; previous revision: 1.9.2.18.2.3
        done

        Show
        lwhite added a comment - With these fixes, the x86 ran for around 24 hours smoothly. Fix checked into tip: SJSAS91_FCS_BRANCH Checking in FailureNotificationEventHandler.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/FailureNotificationEventHandler.java,v <-- FailureNotificationEventHandler.java new revision: 1.5.6.6; previous revision: 1.5.6.5 done Checking in JxtaBiDiPipeWrapper.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/JxtaBiDiPipeWrapper.java,v <-- JxtaBiDiPipeWrapper.java new revision: 1.17.2.22; previous revision: 1.17.2.21 done Checking in JxtaReplicationSender.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/JxtaReplicationSender.java,v <-- JxtaReplicationSender.java new revision: 1.18.2.27; previous revision: 1.18.2.26 done Checking in PipeWrapper.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/PipeWrapper.java,v <-- PipeWrapper.java new revision: 1.8.2.14; previous revision: 1.8.2.13 done Checking in ReplicationHealthChecker.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/ReplicationHealthChecker.java,v <-- ReplicationHealthChecker.java new revision: 1.9.2.23; previous revision: 1.9.2.22 done Fix checked into FCS branch: SGES21_FCS_BRANCH Checking in FailureNotificationEventHandler.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/FailureNotificationEventHandler.java,v <-- FailureNotificationEventHandler.java new revision: 1.5.6.5.4.1; previous revision: 1.5.6.5 done Checking in JxtaBiDiPipeWrapper.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/JxtaBiDiPipeWrapper.java,v <-- JxtaBiDiPipeWrapper.java new revision: 1.17.2.18.2.4; previous revision: 1.17.2.18.2.3 done Checking in JxtaReplicationSender.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/JxtaReplicationSender.java,v <-- JxtaReplicationSender.java new revision: 1.18.2.24.2.3; previous revision: 1.18.2.24.2.2 done Checking in PipeWrapper.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/PipeWrapper.java,v <-- PipeWrapper.java new revision: 1.8.2.13.2.1; previous revision: 1.8.2.13 done Checking in ReplicationHealthChecker.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/ReplicationHealthChecker.java,v <-- ReplicationHealthChecker.java new revision: 1.9.2.18.2.4; previous revision: 1.9.2.18.2.3 done
        Hide
        easarina added a comment -

        Verified 24X1 run on x86, build 60d nightly 12/16.

        Show
        easarina added a comment - Verified 24X1 run on x86, build 60d nightly 12/16.

          People

          • Assignee:
            Mahesh Kannan
            Reporter:
            easarina
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: