1. shoal
  2. SHOAL-74

potential to miss FAILURE_NOTIFICATION when multiple instances killed at same time


    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: current
    • Fix Version/s: 1.1
    • Component/s: GMS
    • Labels:
    • Environment:

      Operating System: All
      Platform: All

    • Issuezilla Id:


      Bug was uncovered during a code review. The bug is a FAILURE notification could
      be missed when 2 or more more instances are killed at same time. (Note that
      given the race condition between node agent restarting a killed instance and the
      failure notification, only a test that kills the node agent and then kills
      instances can be assured of seeing a FALIURE_NOTIFICATION for each server
      instance killed. A node agent can restart a server instance before shoal
      reports it as FAILED.)

      HealthMonitor.InDoubtPeerDetector.processCacheUpdate() iterates over all
      instances in cluster checking if any are in doubt. If one instance is detected
      to be indoubt, HealthMonitor.InDoubtPeerDetector.determineInDoubtPeers() notifies
      the FailureVerifier thread to process current cache looking for InDoubtPeers to
      verify which instance should have FAILURE_NOTIFICATION sent.

      synchronized (verifierLock)

      { verifierLock.notify(); LOG.log(Level.FINER, "Done Notifying FailureVerifier for " + entry.adv.getName()); }

      The notification signal from InDoubtPeerDetector thread to FailureVerifier
      thread is the weak link in this bug. When multiple failures happen at once, the
      code is currently written to act on the first instance failure immediately. The
      InPeerDoubtDetector should iterate over all instances AND if one OR more
      instances are in doubt, then it should notify the FailureVerifier thread to run
      over all instances in cluster cache.

      Bug could be that InDoubtPeerDetector, runs twice, one notifiying
      FailureVerifier() to run on instance cache and it detects first killed instance.
      The second time the InDoubtPeerDetector runs, it could notify the
      FailureDetector while it is still working on verifiying first failure (with a
      snap shotted cache). The second notify to a running FailureVerifier thread will
      have no impact and the FAILURE_NOTIFICATION for the second killed server
      instance will be detected much later when the next failure occurs or the client
      is shutdown.


        There are no comments yet on this issue.


          • Assignee:
            Joe Fialli
            Joe Fialli
          • Votes:
            0 Vote for this issue
            1 Start watching this issue


            • Created: