Version 14

    Definition

     

    Failure detection based on heartbeat messages. A member sends 'are-you-alive' messages with a periodicity of 'timeout' milliseconds. After the first missing heartbeat response, the initiating member send more 'max_tries' heartbeat messages and the target member is declared suspect only after all heartbeat messages go unanswered.

     

    In the worst case, when the target member dies immediately after answering a heartbeat, the failure takes timeout + timeout + max_tries * timeout = (max_tries + 2) * timeout milliseconds to detect.

     

    Once a member is declared suspected it will be excluded by GMS. SUSPECT event handling is also subject to interaction with VERIFY_SUSPECT. If we use FD_SOCK instead, then we don't send heartbeats, but establish TCP sockets and declare a member dead only when a socket is closed.

     

     

    Configuration Example

     

        <FD timeout="2000" max_tries="3" shun="true"></FD>
    

     

    Configuration Parameters


    NameDescription
    idGive the protocol a different ID if needed so we can have multiple instances of it in the same stack
    levelSets the logger level (see javadocs)
    max_triesNumber of times to send an are-you-alive message
    nameGive the protocol a different name if needed so we can have multiple instances of it in the same stack
    statsDetermines whether to collect statistics (and expose them via JMX). Default is true
    timeoutTimeout to suspect a node P if neither a heartbeat nor data were received from P. Default is 3000 msec

     

    See also Protocol Configuration Common Parameters.

     

     

     

    Advanced

     

    Each member send a message containing a "FD" - HEARTBEAT header to its neighbor to the right (identified by the  address). The heartbeats are sent by the inner class

     

    When the neighbor receives the HEARTBEAT, it replies with a message containing a "FD" - HEARTBEAT_ACK header. The first member watches for "FD" - HEARTBEAT_ACK replies from its neigbor. For each received reply, it resets the  timestamp (sets it to current time) and  counter (sets it to 0).

     

    The same  instance that sends heartbeats whatches the difference between current time and . If this difference grows over , the  cycles several more times (until ) is reached) and then sends a SUSPECT message for the neighbor's address. The SUSPECT message is sent down the stack, is addressed to all members, and is as a regular message with a  header.

     

     

    Cause of missing heartbeats in FD

     

    Sometimes a member is suspected by FD because a hearbeat ack has not been received for some time T (defined by timeout and max_tries). This can have multiple reasons, e.g. in a cluster of A,B,C,D; C can be suspected if (note that A pings B, B pings C, C pings D and D pings A):

    • B or C are running at 100% CPU for more than T seconds. So even if C sends a heartbeat ack to B, B may not be able to process it because it is at 100%

    • B or C and garbage collecting, same as above.

    • A combination of the 2 cases above

    • The network loses packets. This usually happens when there is a lot of traffic on the network, and the switch starts dropping packets (usually broadcasts first, then IP multicasts, TCP packets last).

    • B or C are processing a callback. Let's say C received a remote method call (e.g. via RpcDispatcher), and takes T+1 seconds to process it. During this time, C will not process any other messages, including heartbeats, and therfore B will not receive the heartbeat ack and suspect C. This will change in JGroups 2.5 with the threadless stack, out-of-band messages and priority messages. As a workaround for the time being, consider running long tasks in a callback on a separate thread

     

     

     

    For more details refer to Failure Detection

     

    Back To JGroups