Sending of messages in viewAccepted() callback

Sending of messages in a viewAccepted() callback in general is a bad idea, because there's unexpected behavior associated with this.

 

Here's what happens. Let's say we have a cluster {A,B,C} and a new joiner D.

 

D sends a join request to A (the coordinator) and A then multicasts the new view V={A,B,C,D} to {A,B,C} and also sends V to D.

 

At time T1, B receives V and thus its viewAccepted() callback in invoked. Now B multicasts message M.

 

At T2, D receives M, but discards M because D's view hasn't been set yet

 

At T3, D receives V.

 

B's message M will not have been received by D, and it will take another message from B to retransmit M for D (or the stable task to kick in).

 

Same goes for D sending a message M, if D receives V before any of the other nodes: all nodes who haven't received V yet, will discard M, and M will later have to be retransmitted.

 

If it is OK to wait for M to get delivered, until either the sender sends another message or stability kicks in, then that's fine and the app should be OK.

 

If message M is a blocking RPC, then the RPC will block until all requests have been delivered and the responses sent back to the invoker. However, some of the requests might take some time to get delivered and thus the RPC will block for that time.

 

Problems with FLUSH

 

There's also another issue, if messages are sent on the same thread that invoked viewAccepted(): if FLUSH is on the stack, then everybody is blocked sending messages until the flush has completed.

 

What happens during a flush is that - when a view is to be installed in a cluster - first the cluster nodes are flushed. This means that FLUSH runs a 2 phase protocol, which makes sure that all nodes in view V1 have delivered the same set of messages in V1. Then everybody is blocked from sending new messages, and view V2 is installed. After V2  has been installed across the cluster, everybody's unblocked again.

 

This means that any message sent in viewAccepted() will block when FLUSH is present.

 

This is bad because viewAccepted() is invoked by a JGroups thread, and that thread needs to return from viewAccepted() to complete the flush process. However, it won't if the message sent in viewAccepted() blocks ! Thus we have a deadlock...

 

The recommendation is to (a) avoid sending messages in viewAccepted() and (b) if messages need to be sent, they should be sent in a separate thread (or timer task).