shoal
  1. shoal
  2. SHOAL-38

HealthMonitoring support for hardware/network failures avoiding TCP timeouts

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: current
    • Fix Version/s: milestone 1
    • Component/s: GMS
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: OpenSolaris

    • Issuezilla Id:
      38

      Description

      With a hardware or network failure, current JxtaMgmt provider's HealthMonitor
      will go into TCP timeout which on certain systems can be as long as 10 minutes.
      Need a timeout based mechanism to allow applications to configure a timeout
      after which a TCP socket connection based liveness check should terminate and
      assign the member as failed. This is needed to provide robustness in the face of
      hardware failures.

      Fix for this needs to come from JXTA for SailFin as it is a critical req for
      Ericsson.

        Activity

        Hide
        shreedhar_ganapathy added a comment -

        Sheetal has integrated a fix into the trunk wrt this feature. The feature allows
        health monitoring to report a failure when a failure detection related tcp
        connection is blocked for a configured timeout (set to 30 seconds default).

        The timeout is configured using the FAILURE_DETECTION_TCP_RETRANSMIT_TIMEOUT and
        FAILURE_DETECTION_TCP_RETRANSMIT_PORT properties specified in
        ServiceProviderConfigurationKeys.java.

        Javadoc corresponding to these properties are as follows:
        FAILURE_DETECTION_TCP_RETRANSMIT_PORT
        This value of this key is a port common to all cluster members where
        a socket will be attempted to be created when a particular instance's configured
        periodic heartbeats have been missed for the max retry times.

        FAILURE_DETECTION_TCP_RETRANSMIT_TIMEOUT
        Maximum time that the health monitoring protocol would wait for a
        reachability query to block for a response.

        Show
        shreedhar_ganapathy added a comment - Sheetal has integrated a fix into the trunk wrt this feature. The feature allows health monitoring to report a failure when a failure detection related tcp connection is blocked for a configured timeout (set to 30 seconds default). The timeout is configured using the FAILURE_DETECTION_TCP_RETRANSMIT_TIMEOUT and FAILURE_DETECTION_TCP_RETRANSMIT_PORT properties specified in ServiceProviderConfigurationKeys.java. Javadoc corresponding to these properties are as follows: FAILURE_DETECTION_TCP_RETRANSMIT_PORT This value of this key is a port common to all cluster members where a socket will be attempted to be created when a particular instance's configured periodic heartbeats have been missed for the max retry times. FAILURE_DETECTION_TCP_RETRANSMIT_TIMEOUT Maximum time that the health monitoring protocol would wait for a reachability query to block for a response.

          People

          • Assignee:
            hamada
            Reporter:
            sheetalv
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: