sailfin
  1. sailfin
  2. SAILFIN-1862

Removal of stale replica's during expiry processing causes lots of OOM

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: milestone 1
    • Component/s: session_replication
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

    • Issuezilla Id:
      1,862

      Description

      In the in-memory store implementation, a replica can become stale due to various
      reasons, for example:

      (a) A remove replica message is lost.
      (b) Replica partner is changed, and a load-ack message was lost by the the
      previous replica partner.

      In such circumstances, there is a need to remove the stale replicas. This is
      currently achieved by running a background thread (so called reaper thread).

      Due to the recent changes to the reaper thread due to SPI refactoring, looks
      like store.remove() is invoked when a replica is found stale. Doing this is
      incorrect because store.remove() will send a remove broadcast message, so will
      get rid of the replicas from the current replica partner as well – which is not
      correct. The intention was to remove only the stale replica copies from own
      replica cache, not from anywhere else.

        Activity

        Hide
        Scott Oaks added a comment -

        I have fixed the processing of the stale replicas substantially while working on
        this bug.

        First, I have removed all broadcast load acknowledgments. Processing of these
        substantially slowed down JXTA in large clusters because of the proliferation of
        the number of messages each instance had to handle (essentially, X hours after
        an instance restarts, a broadcast storm was created when all the stale replicas
        expired – where X is the session expiration value). These broadcast storms
        prevented other work from happening, causing JXTA to back up, causing the
        replication states to cause OOM errors.

        Acknowledgments are now sent via unicast – that makes it to the correct
        receiver (correcting the bug reported here) but also preventing many of the OOM
        issues we have previously seen.

        Checking in
        replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/DialogFragmentStoreImpl.java;
        /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/DialogFragmentStoreImpl.java,v
        <-- DialogFragmentStoreImpl.java
        new revision: 1.44; previous revision: 1.43
        done
        Checking in
        replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/ReplicationDialogFragmentManager.java;
        /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/ReplicationDialogFragmentManager.java,v
        <-- ReplicationDialogFragmentManager.java
        new revision: 1.92; previous revision: 1.91
        done
        Checking in
        replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/ServletTimerStoreImpl.java;
        /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/ServletTimerStoreImpl.java,v
        <-- ServletTimerStoreImpl.java
        new revision: 1.61; previous revision: 1.60
        done
        Checking in
        replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipApplicationSessionStoreImpl.java;
        /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipApplicationSessionStoreImpl.java,v
        <-- SipApplicationSessionStoreImpl.java
        new revision: 1.71; previous revision: 1.70
        done
        Checking in
        replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipSessionStoreImpl.java;
        /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipSessionStoreImpl.java,v
        <-- SipSessionStoreImpl.java
        new revision: 1.57; previous revision: 1.56
        done
        Checking in
        replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipTransactionPersistentManager.java;
        /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipTransactionPersistentManager.java,v
        <-- SipTransactionPersistentManager.java
        new revision: 1.185; previous revision: 1.184
        done

        Show
        Scott Oaks added a comment - I have fixed the processing of the stale replicas substantially while working on this bug. First, I have removed all broadcast load acknowledgments. Processing of these substantially slowed down JXTA in large clusters because of the proliferation of the number of messages each instance had to handle (essentially, X hours after an instance restarts, a broadcast storm was created when all the stale replicas expired – where X is the session expiration value). These broadcast storms prevented other work from happening, causing JXTA to back up, causing the replication states to cause OOM errors. Acknowledgments are now sent via unicast – that makes it to the correct receiver (correcting the bug reported here) but also preventing many of the OOM issues we have previously seen. Checking in replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/DialogFragmentStoreImpl.java; /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/DialogFragmentStoreImpl.java,v <-- DialogFragmentStoreImpl.java new revision: 1.44; previous revision: 1.43 done Checking in replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/ReplicationDialogFragmentManager.java; /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/ReplicationDialogFragmentManager.java,v <-- ReplicationDialogFragmentManager.java new revision: 1.92; previous revision: 1.91 done Checking in replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/ServletTimerStoreImpl.java; /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/ServletTimerStoreImpl.java,v <-- ServletTimerStoreImpl.java new revision: 1.61; previous revision: 1.60 done Checking in replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipApplicationSessionStoreImpl.java; /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipApplicationSessionStoreImpl.java,v <-- SipApplicationSessionStoreImpl.java new revision: 1.71; previous revision: 1.70 done Checking in replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipSessionStoreImpl.java; /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipSessionStoreImpl.java,v <-- SipSessionStoreImpl.java new revision: 1.57; previous revision: 1.56 done Checking in replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipTransactionPersistentManager.java; /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipTransactionPersistentManager.java,v <-- SipTransactionPersistentManager.java new revision: 1.185; previous revision: 1.184 done
        Hide
        Scott Oaks added a comment -

        Some processing leads to stale entries being served out of the dialog fragment
        and sip session caches. These are still not cleaned up correctly on restart.

        Show
        Scott Oaks added a comment - Some processing leads to stale entries being served out of the dialog fragment and sip session caches. These are still not cleaned up correctly on restart.
        Hide
        Scott Oaks added a comment -

        Note that the effect of the partial fix (that is, the bug leftover after the
        first checkin) is no longer an OOM, but errors in accessing SAS attributes,
        which will be out of date. In PresenceServlet tests, that means lots of Stored
        Cseq is different by more than 1 from the new msgCseq messages.

        Fixed code for the session and DF:

        Checking in
        replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/ReplicationDialogFragmentManager.java;
        /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/ReplicationDialogFragmentManager.java,v
        <-- ReplicationDialogFragmentManager.java
        new revision: 1.95; previous revision: 1.94
        done
        Checking in
        replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipTransactionPersistentManager.java;
        /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipTransactionPersistentManager.java,v
        <-- SipTransactionPersistentManager.java
        new revision: 1.187; previous revision: 1.186
        done
        Checking in
        appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/ExpatListHandler.java;
        /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/Attic/ExpatListHandler.java,v
        <-- ExpatListHandler.java
        new revision: 1.1.2.9; previous revision: 1.1.2.8
        done

        Show
        Scott Oaks added a comment - Note that the effect of the partial fix (that is, the bug leftover after the first checkin) is no longer an OOM, but errors in accessing SAS attributes, which will be out of date. In PresenceServlet tests, that means lots of Stored Cseq is different by more than 1 from the new msgCseq messages. Fixed code for the session and DF: Checking in replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/ReplicationDialogFragmentManager.java; /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/dialogmgmt/ReplicationDialogFragmentManager.java,v <-- ReplicationDialogFragmentManager.java new revision: 1.95; previous revision: 1.94 done Checking in replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipTransactionPersistentManager.java; /cvs/sailfin/replication/src/main/java/org/jvnet/glassfish/comms/replication/sessmgmt/SipTransactionPersistentManager.java,v <-- SipTransactionPersistentManager.java new revision: 1.187; previous revision: 1.186 done Checking in appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/ExpatListHandler.java; /cvs/glassfish/appserv-core-ee/http-session-persistence/src/java/com/sun/enterprise/ee/web/sessmgmt/Attic/ExpatListHandler.java,v <-- ExpatListHandler.java new revision: 1.1.2.9; previous revision: 1.1.2.8 done

          People

          • Assignee:
            Scott Oaks
            Reporter:
            Bhavanishankar
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: