HandleJoinProblem

Troubleshooting the handleJoin() problem

 

The problem is that a new node cannot join a cluster. The error message in the log is

 

[ClientGmsImpl] handleJoin(myhost:1234) failed, retrying

 

This is often times going on for a long time, and the new node is blocked. Let me explain what goes on

conceptually when a client joins, and then focus on how we can tackle the problem.

 

The algorithm for joining a cluster is very simple, in pseudo code:

  - loop
      - find initial members (discovery)
      - if no responses:
          - become singleton group and break out of the loop
      - else:
          - determine the coordinator (oldest member) from the responses
          - send JOIN request to coordinator
          - wait for JOIN response
          - if JOIN response received:
               - install view and break out of the loop
          - else
               - sleep for 5 seconds and continue the loop

 

So there are actually 2 parts involved: discovery and group membership. To enable tracing,

we have to enable these 2 protocols to be traced (assuming discovery is PING):

log4j.logger.org.jgroups.protocols.PING=trace
log4j.logger.org.jgroups.protocols.pbcast.GMS=trace

The (log4j) example needs a root logger which has a file appender.

 

 

So let's look at the log4j output of the new node which joins and the coordinator node

which handles the JOIN request.

The coordinator in the example is 192.168.0.15:1059,

the joining node 192.168.0.15:1064.

 

First, 1064 sends out a discovery request which is received by 1059:

13:27:15,822 [DEBUG] org.jgroups.protocols.PING: received GET_MBRS_REQ from 192.168.0.15:1064, 
sending response [PING: type=GET_MBRS_RSP, arg=[own_addr=192.168.0.15:1059, coord_addr=192.168.0.15:1059, is_server=true]]

The response says that 1059 is the coordinator, so 1064 has to send the JOIN_REQ to 1059.

 

Now 1064 receives the discovery response sent by 1059, determines that 1059 is the coordinator and sends the JOIN_REQ to 1059:

13:27:16,844 [DEBUG] org.jgroups.protocols.PING: received GET_MBRS_RSP, rsp=[own_addr=192.168.0.15:1059, coord_addr=192.168.0.15:1059, is_server=true]
13:27:17,765 [DEBUG] org.jgroups.protocols.pbcast.GMS: initial_mbrs are [[own_addr=192.168.0.15:1059, coord_addr=192.168.0.15:1059, is_server=true]]
13:27:17,765 [DEBUG] org.jgroups.protocols.pbcast.GMS: election results: {192.168.0.15:1059=1}
13:27:17,765 [DEBUG] org.jgroups.protocols.pbcast.GMS: sending handleJoin(192.168.0.15:1064) to 192.168.0.15:1059

 

 

Now the coordinator at 1059 receives the JOIN_REQ from 1064, adds 1064 to the current group,

and then sends a JOIN_RSP to 1064:

13:27:17,825 [DEBUG] org.jgroups.protocols.pbcast.GMS: processing JOIN(192.168.0.15:1064)
13:27:17,835 [DEBUG] org.jgroups.protocols.pbcast.GMS: joined member 192.168.0.15:1064, view is [192.168.0.15:1059|1] [192.168.0.15:1059, 192.168.0.15:1064]

 

The new view (

[192.168.0.15:1059|1] [192.168.0.15:1059, 192.168.0.15:1064]

)

shows that both 1059 and 1064 are now members, 1059 is the coordinator and this

is the second view (1, as views start at 0).

 

The new member now receives the JOIN_RSP, including the view. This causes it to move from

client to participant role:

13:27:17,915 [DEBUG] org.jgroups.protocols.pbcast.GMS: [192.168.0.15:1064]: JoinRsp=[192.168.0.15:1059|1] [192.168.0.15:1059, 192.168.0.15:1064] [size=2]
13:27:18,005 [DEBUG] org.jgroups.protocols.pbcast.GMS: 192.168.0.15:1064 changed role to org.jgroups.protocols.pbcast.ParticipantGmsImpl

 

 

As we can see, the time between sending the JOIN_REQ to the coordinator and receiving

the JOIN_RSP is roughly 170ms.

 

How to use the logs to troubleshoot

The following steps should be taken to narrow down the problem:

  • Find out whom the client sends the JOIN_REQ to (in this case 1059)

  • Check the logs on the coordinator 1059 to see whether the JOIN_REQ from 1064 was received

    • If yes: check whether a new view was created and sent back to 1064. Also check on 1064 whether the JOIN_RSP was received

    • If no: enable full tracing, to follow the JOIN_REQ (wrapped in a JGroups Message) down the stack at the client, up the stack on the coordinator, then follow the JOIN_RSP down the stack at the coordinator, and up the stack at the client. If any of those 2 segments of network traffic fail, there might be a network problem. For a typical UDP:PING stack, full tracing doesn't need to be enabled, but simply tracing for UDP and UNICAST.

 

Additional tips

  • State the version of JGroups you are using

  • If possible, always use the latest released version of JGroups

  • Attach the JGroups configuration you use

  • Try to reproduce the problem with JGroups standalone first, only then run JBoss

  • Take a stack trace on the coordinator when this problem occurs to see if the coordinator is hanging for some other reason, and cannot process the JOIN_REQ

 

Separation of clusters

If clusters are not cleanly separated, the following problem could occur:

 

CL1 uses partition name DefaultPartition, mcast_addr=228.1.2.3 and mcast_port=77777.

 

CL2 uses partition name DefaultPartition, mcast_addr=228.1.2.3 and mcast_port=88888.

 

 

Apart from having almost the same transport configuration, CL1 and CL2 have different configurations,

e.g. CL1 uses encyption and compression, whereas CL2 doesn't.

Due to a bug in Linux, the follwing could now happen:

  • The first member in CL1 is started (A)

  • The first member in CL2 is started (B)

  • B does a discovery and gets a response from A. This should not happen, but does because of the Linux bug and the non-existent separation of CL1 and CL2 (only the port is different)

  • Instead of becoming coordinator, B tries to send a JOIN_REQ to A

  • A receives the request, but doesn't 'understand' it because it comes from a member with a different stack

  • The JOIN_REQ on A is never propagated up the stack to the GMS to handle it

  • The client (B) never gets a response and continues sending JOIN_REQs to A

 

Reincarnation

The handleJoin() issue can also be caused by reincarnation. Reincarnation occurs when a member crashes and is restarted under the same IP address and port. Since the address:port pair forms the identity of a JGroups member (this will get changed with logical addresses, if we introduce them), JGroups cannot differentiate between a reincarnated member and an existing member if failure detection didn't exclude the crashed member before it rejoined the cluster.

 

Let's look at an example: assume the group {A,B,C}

  • A, B and C send messages to each other

  • A has received 25 unicast messages from C. This information is maintained in UNICAST (if UDP is used)

  • The stack has FD with a timeout of 20 seconds, but no FD_SOCK

  • C crashes and is restarted right away. Had 20 seconds passed before restarting C, A and B would have received a new view {A,B} and all state pertaining to C would have been removed, e.g. from UNICAST, STABLE and NAKACK. Then C could have rejoined as a new member

  • However, before C is excluded, it is restarted and sends a unicast JOIN request with sequence number (seqno) 1 to A

  • A hasn't yet cleared its tables of the state of C, therefore it expects the next seqno from C to be 25, and thus discards the JOIN request

  • C continues sending JOIN requests to A, until its seqno is 26, and only then will the JOIN request get accepted !

  • Unfortunately, by C sending JOIN requests to A, A does not suspect C, as traffic from C counts as heartbeat acks and therefore the C will not get suspected as long as it continues sending JOIN requests to A

 

What can be done to solve this problem ?

 

The first solution is to add FD_SOCK to the stack. This helps detecting certain types of crashes sooner, and so members are able to remove state pertaining to crashed members right away.

 

However, FD_SOCK does not detect crashes caused by kernel panic (as the TCP sockets are not closed), or crashes of switches. Therefore, the second solution is to remember the port numbers used on a given host for some time before they can be reused.

 

This is done by storing port numbers in a file in temporary storage (by default in the temp directory in a file jgroups-ports.txt). Whenever a new port is opened, the port number is written to that file. Before the port is opened, the file is consulted for available ports. After some time, older ports are removed from the file.

 

To configure JGroups to do this, the following properties can be set in any transport (subclass of TP, e.g. UDP, TCP, TCP-NIO):

  • persistent_ports (default: false): whether or not to maintain the list of recently used ports in stable storage

  • persistent_ports_file (default: java.io.tmpdir/jgroups-ports.txt): the fully qualified filename

  • ports_expiry_time (default: 30000): the time in ms after which ports are removed and can subsequently be reused

 

The JIRA issue for this is http://jira.jboss.com/jira/browse/JGRP-637.

 

Reincarnation when using UDP without bind_port

In this case, ephemeral ports should be allocated by the OS. However, I found that on Linux this is not the case (supposedly fixed after 2.6.22.x). If processes are killed and immediately restarted, and the OS reuses the previous port, then we might also run into the reincarnation issue.

In this case, we can set UDP.prevent_port_reuse="true" and we will force the allocation of a new port.