sailfin
  1. sailfin
  2. SAILFIN-1031

[blocking] An instance's Transient Failure can result in a request being processed by it before it is ready

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: b43
    • Component/s: load_balancer
    • Labels:
      None
    • Environment:

      Operating System: Solaris SPARC
      Platform: Sun

      Description

      Sailfin Build 40

      Transient Failure Scenario:

      (Using the term Transient Failure to mean - Killing an instance without killing
      its Node Agent. As a result, the NA starts back the instance immediately.)

      • Send INVITE
      • Receive 200 OK
      • Send ACK
      • Kill serving instance process (do NOT kill Node Agent - and so - Node Agent
        restarts the instance)
      • Send BYE within 5 seconds of killing the serving instance
      • Continue to re-send BYE if a 503 is received until 200 OK

      Setup for the logs attached:

      • Cluster with 5 instances - instance101, instance102, instance103, instance104,
        instance105
      • All requests are initially sent to instance101 which then forwards it to the
        appropriate instance.
      • NOTE: In this run instance105 was the serving instance.
      • After killing instance105, the SIP client continues to re-send BYE to
        instance101 and instance101 continues to try to send the BYE (and its re-sends)
        to instance102 and gets Connection Refused and correctly responds to the SIP
        client with a 503. This should happen until instance105 starts back up.

      Behaviour:

      • When the Node Agent restarts instance105, CLB at instance105 finds that
        instance105 is unhealthy and so updates the Actual_Hash table to remove it.
      • In the meantime the SIP client continues to re-send BYE and instance101
        continues to try and forward BYE to instance102. instance101 continues to
        respond with 503 failure until it is actually able to forward the request to
        instance102.
      • Now, it looks like CLB_ActualHash is next updated (and instance105 added to
        it) only on GMS JOINED_AND_READY_EVENT for instance105 i.e. after instance105
        starts back up.
      • What seems to happen is that before CLB_ActualHash is updated instance101 is
        able to forward the request to instance105 (perhaps because the
        SIP-listener/proxy at instance105 is up)
      • Now instance105s Actual_Hash does not contain itself in it and so it forwards
        the BYE (to instance102 in this case). instance102 forwards it back to
        instance105 and the loop continues.
      • At this point either of two things happen:
        a) The SIP message grows larger due the looping and grizzly or sip module is
        unable to parse it anymore [will check if another issue needs to be filed for this.]
        b) instance105 completes to startup before this parse error, the looping
        continues until the GMS JOINED_AND_READY_EVENT is seen, CLB_ActualHash is
        updated with instance105 and the request gets processed correctly with a 200 OK.

      Issue:

      • Requests need to be accepted only after CLB_ActualHash is updated and the
        instance can begin to process requests.

      Log Messages of Interest:

      A) CLB reconfiguration snippets when the NA restarts the serving instance105 -

      • first below is the snippet indicating instance105 being found unhealthy.
      • this is followed by snippet indicating update of the CLB_ActualHash table.
        ************
        [#|2008-07-08T04:28:11.273-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=10;_ThreadName=main;ClassName=org.j
        vnet.glassfish.comms.clb.core.ClusterHealthMonitor;MethodName=getHealthyInstances;st-cluster;[instance103,
        instance101, instance104, instance10
        5, instance102];_RequestID=ad3f8996-d542-4b86-9ae5-42751b44fcfc;|CLBC0081: Core
        members of GMS module for cluster st-cluster : [instance103, in
        stance101, instance104, instance105, instance102]|#]

      [#|2008-07-08T04:28:11.273-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=10;_ThreadName=main;ClassName=org.j
      vnet.glassfish.comms.clb.core.ClusterHealthMonitor;MethodName=getHealthyInstances;st-cluster;[instance103,
      instance101, instance104, instance10
      5, instance102];_RequestID=ad3f8996-d542-4b86-9ae5-42751b44fcfc;|CLBC0082:
      Current core members of GMS module for cluster st-cluster : [instanc
      e103, instance101, instance104, instance105, instance102]|#]

      [#|2008-07-08T04:28:13.481-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=10;_ThreadName=main;ClassName=org.j
      vnet.glassfish.comms.clb.core.ServerCluster;MethodName=markInstancesHealthy;instance103;st-cluster;_RequestID=ad3f8996-d542-4b86-9ae5-42751b44f
      cfc;|CLBC0064: Marking instance instance103 in cluster st-cluster as healthy|#]

      [#|2008-07-08T04:28:13.482-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=10;_ThreadName=main;ClassName=org.j
      vnet.glassfish.comms.clb.core.ServerCluster;MethodName=markInstancesHealthy;instance101;st-cluster;_RequestID=ad3f8996-d542-4b86-9ae5-42751b44f
      cfc;|CLBC0064: Marking instance instance101 in cluster st-cluster as healthy|#]

      [#|2008-07-08T04:28:13.482-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=10;_ThreadName=main;ClassName=org.j
      vnet.glassfish.comms.clb.core.ServerCluster;MethodName=markInstancesHealthy;instance104;st-cluster;_RequestID=ad3f8996-d542-4b86-9ae5-42751b44f
      cfc;|CLBC0064: Marking instance instance104 in cluster st-cluster as healthy|#]

      [#|2008-07-08T04:28:13.483-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=10;_ThreadName=main;ClassName=org.j
      vnet.glassfish.comms.clb.core.ServerCluster;MethodName=markInstancesHealthy;instance102;st-cluster;_RequestID=ad3f8996-d542-4b86-9ae5-42751b44f
      cfc;|CLBC0064: Marking instance instance102 in cluster st-cluster as healthy|#]

      ***********

      [#|2008-07-08T04:28:14.028-0700|FINEST|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=10;_ThreadName=main;ClassName=org
      .jvnet.glassfish.comms.clb.core.util.ConsistentHash;MethodName=setup;_RequestID=ad3f8996-d542-4b86-9ae5-42751b44fcfc;|Setup
      done: Current state
      :
      org.jvnet.glassfish.comms.clb.core.util.ConsistentHash@e3f6d(CLB_ActualHash)=[st-cluster/instance101,
      st-cluster/instance102, st-cluster/inst
      ance103, st-cluster/instance104]|#]

      **********

      B) instance105 starts back up, GMS JOINED_AND_READY_EVENT seen and
      CLB_ActualHash is updated (and instance105 added to it)
      **********

      [#|2008-07-08T04:28:37.816-0700|INFO|sun-comms-appserver1.0|ShoalLogger|_ThreadID=13;_ThreadName=ViewWindowThread;JOINED_AND_READY_EVENT;|Analy
      zing new membership snapshot received as part of event : JOINED_AND_READY_EVENT|#]

      [#|2008-07-08T04:28:37.817-0700|INFO|sun-comms-appserver1.0|javax.enterprise.system.core.selfmanagement|_ThreadID=10;_ThreadName=main;|SMGT0007
      : Self Management Rules service is enabled|#]

      [#|2008-07-08T04:28:37.826-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=28;_ThreadName=pool-2-thread-3;Clas
      sName=org.jvnet.glassfish.comms.clb.core.ClusterHealthMonitor$CHMGMSEventListenerImpl;MethodName=onRecovery;instance105;st-cluster;_RequestID=d
      ffa5485-9146-4247-a5a0-8ab81b5e2235;|CLBC0083: Recieved recovery notification
      for instance instance105 of cluster st-cluster|#]

      [#|2008-07-08T04:28:37.826-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=28;_ThreadName=pool-2-thread-3;Clas
      sName=org.jvnet.glassfish.comms.clb.core.ServerInstance;MethodName=markAsHealthy;instance105;org.jvnet.glassfish.comms.clb.core.common.chr.Cons
      istentHashRouter@1660d22;_RequestID=dffa5485-9146-4247-a5a0-8ab81b5e2235;|CLBC0068:
      Sending recovery event for instance instance105 to router o
      rg.jvnet.glassfish.comms.clb.core.common.chr.ConsistentHashRouter@1660d22|#]

      [#|2008-07-08T04:28:37.827-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=28;_ThreadName=pool-2-thread-3;Clas
      sName=org.jvnet.glassfish.comms.clb.core.common.chr.ConsistentHashRouter;MethodName=addToActualConsistentHash;instance105;_RequestID=dffa5485-9
      146-4247-a5a0-8ab81b5e2235;|CLBC0109: Adding instance instance105 to active list.|#]

      [#|2008-07-08T04:28:37.836-0700|FINEST|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=28;_ThreadName=pool-2-thread-3;Cl
      assName=org.jvnet.glassfish.comms.clb.core.util.ConsistentHash;MethodName=setup;_RequestID=dffa5485-9146-4247-a5a0-8ab81b5e2235;|Setup
      done: Cu
      rrent state:
      org.jvnet.glassfish.comms.clb.core.util.ConsistentHash@e3f6d(CLB_ActualHash)=[st-cluster/instance105,
      st-cluster/instance101, st-c
      luster/instance102, st-cluster/instance103, st-cluster/instance104]|#]

      [#|2008-07-08T04:28:37.836-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=28;_ThreadName=pool-2-thread-3;Clas
      sName=org.jvnet.glassfish.comms.clb.core.ServerInstance;MethodName=markAsHealthy;instance105;org.jvnet.glassfish.comms.clb.core.common.chr.Stic
      kyConsistentHashRouter@85968b;_RequestID=dffa5485-9146-4247-a5a0-8ab81b5e2235;|CLBC0068:
      Sending recovery event for instance instance105 to rou
      ter
      org.jvnet.glassfish.comms.clb.core.common.chr.StickyConsistentHashRouter@85968b|#]

      [#|2008-07-08T04:28:37.836-0700|FINE|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=28;_ThreadName=pool-2-thread-3;Clas
      sName=org.jvnet.glassfish.comms.clb.core.common.chr.ConsistentHashRouter;MethodName=addToActualConsistentHash;instance105;_RequestID=dffa5485-9
      146-4247-a5a0-8ab81b5e2235;|CLBC0109: Adding instance instance105 to active list.|#]

      [#|2008-07-08T04:28:37.845-0700|FINEST|sun-comms-appserver1.0|javax.enterprise.system.container.clb|_ThreadID=28;_ThreadName=pool-2-thread-3;Cl
      assName=org.jvnet.glassfish.comms.clb.core.util.ConsistentHash;MethodName=setup;_RequestID=dffa5485-9146-4247-a5a0-8ab81b5e2235;|Setup
      done: Cu
      rrent state:
      org.jvnet.glassfish.comms.clb.core.util.ConsistentHash@10e130c(CLB_ActualHash)=[st-cluster/instance105,
      st-cluster/instance101, st
      -cluster/instance102, st-cluster/instance103, st-cluster/instance104]|#]

      **********

        Activity

        Hide
        varunrupela added a comment -

        Created an attachment (id=548)
        attaching logs

        Show
        varunrupela added a comment - Created an attachment (id=548) attaching logs
        Hide
        sanandal added a comment -

        Marked keyword as shark-na

        Show
        sanandal added a comment - Marked keyword as shark-na
        Hide
        pankaj_jairath added a comment -

        Discussed the fix required and reassigning to Kshitiz.

        Show
        pankaj_jairath added a comment - Discussed the fix required and reassigning to Kshitiz.
        Hide
        varunrupela added a comment -

        This issue blocks a session replication test from completing.

        Show
        varunrupela added a comment - This issue blocks a session replication test from completing.
        Hide
        kshitiz_saxena added a comment -

        Two checks are added to stop looping
        1. Upon restart of the instance from transient failure :
        a) Requests sprayed to the instance such that they are selected to be
        serviced on this instance would be allowed to go up the sip stack into the
        container.
        b) Requests proxied from FE to BE (recovering instance) would be allowed.
        2. For sip incoming request, if "felb" is found in topmost via, request is just
        passed to next layer. There is no further processing done by CLB layer.

        Checkin logs :
        Checking in org/jvnet/glassfish/comms/clb/core/CLBConstants.java;
        /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/CLBConstants.java,v
        <-- CLBConstants.java
        new revision: 1.10; previous revision: 1.9
        done
        Checking in org/jvnet/glassfish/comms/clb/core/ProxyKeyExtractor.java;
        /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/ProxyKeyExtractor.java,v
        <-- ProxyKeyExtractor.java
        new revision: 1.6; previous revision: 1.5
        done
        Checking in org/jvnet/glassfish/comms/clb/core/Router.java;
        /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/Router.java,v
        <-- Router.java
        new revision: 1.8; previous revision: 1.7
        done
        Checking in org/jvnet/glassfish/comms/clb/core/ServerInstance.java;
        /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/ServerInstance.java,v
        <-- ServerInstance.java
        new revision: 1.15; previous revision: 1.14
        done
        Checking in org/jvnet/glassfish/comms/clb/core/StickyRoundRobinRouter.java;
        /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/StickyRoundRobinRouter.java,v
        <-- StickyRoundRobinRouter.java
        new revision: 1.10; previous revision: 1.9
        done
        Checking in
        org/jvnet/glassfish/comms/clb/core/common/chr/StickyConsistentHashRouter.java;
        /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/common/chr/StickyConsistentHashRouter.java,v
        <-- StickyConsistentHashRouter.java
        new revision: 1.9; previous revision: 1.8
        done
        Checking in
        org/jvnet/glassfish/comms/clb/core/common/chr/StickyHashKeyExtractor.java;
        /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/common/chr/StickyHashKeyExtractor.java,v
        <-- StickyHashKeyExtractor.java
        new revision: 1.15; previous revision: 1.14
        done
        Checking in org/jvnet/glassfish/comms/clb/core/sip/SipLoadBalancerBackend.java;
        /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/sip/SipLoadBalancerBackend.java,v
        <-- SipLoadBalancerBackend.java
        new revision: 1.27; previous revision: 1.26
        done
        Checking in
        org/jvnet/glassfish/comms/clb/core/sip/SipLoadBalancerIncomingHandler.java;
        /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/sip/SipLoadBalancerIncomingHandler.java,v
        <-- SipLoadBalancerIncomingHandler.java
        new revision: 1.25; previous revision: 1.24
        done

        Show
        kshitiz_saxena added a comment - Two checks are added to stop looping 1. Upon restart of the instance from transient failure : a) Requests sprayed to the instance such that they are selected to be serviced on this instance would be allowed to go up the sip stack into the container. b) Requests proxied from FE to BE (recovering instance) would be allowed. 2. For sip incoming request, if "felb" is found in topmost via, request is just passed to next layer. There is no further processing done by CLB layer. Checkin logs : Checking in org/jvnet/glassfish/comms/clb/core/CLBConstants.java; /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/CLBConstants.java,v <-- CLBConstants.java new revision: 1.10; previous revision: 1.9 done Checking in org/jvnet/glassfish/comms/clb/core/ProxyKeyExtractor.java; /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/ProxyKeyExtractor.java,v <-- ProxyKeyExtractor.java new revision: 1.6; previous revision: 1.5 done Checking in org/jvnet/glassfish/comms/clb/core/Router.java; /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/Router.java,v <-- Router.java new revision: 1.8; previous revision: 1.7 done Checking in org/jvnet/glassfish/comms/clb/core/ServerInstance.java; /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/ServerInstance.java,v <-- ServerInstance.java new revision: 1.15; previous revision: 1.14 done Checking in org/jvnet/glassfish/comms/clb/core/StickyRoundRobinRouter.java; /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/StickyRoundRobinRouter.java,v <-- StickyRoundRobinRouter.java new revision: 1.10; previous revision: 1.9 done Checking in org/jvnet/glassfish/comms/clb/core/common/chr/StickyConsistentHashRouter.java; /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/common/chr/StickyConsistentHashRouter.java,v <-- StickyConsistentHashRouter.java new revision: 1.9; previous revision: 1.8 done Checking in org/jvnet/glassfish/comms/clb/core/common/chr/StickyHashKeyExtractor.java; /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/common/chr/StickyHashKeyExtractor.java,v <-- StickyHashKeyExtractor.java new revision: 1.15; previous revision: 1.14 done Checking in org/jvnet/glassfish/comms/clb/core/sip/SipLoadBalancerBackend.java; /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/sip/SipLoadBalancerBackend.java,v <-- SipLoadBalancerBackend.java new revision: 1.27; previous revision: 1.26 done Checking in org/jvnet/glassfish/comms/clb/core/sip/SipLoadBalancerIncomingHandler.java; /cvs/sailfin/clb/src/main/java/org/jvnet/glassfish/comms/clb/core/sip/SipLoadBalancerIncomingHandler.java,v <-- SipLoadBalancerIncomingHandler.java new revision: 1.25; previous revision: 1.24 done

          People

          • Assignee:
            kshitiz_saxena
            Reporter:
            varunrupela
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: