DesignOfDynamicRehashing

Raison d'être: This design document is an attempt to clarify the otherwise extremely complex code of Infinispan's dynamic rehashing.  It pertains to the implementation in Infinispan 5.0 at this time, and must be updated with notes accordingly if and when this document becomes out of date due to changes in the design and implementation.

 

WARNING: This is work in progress and should be considered a draft.  The information is accurate, but isn't well presented at the moment and will be updated in due course, with proper state transition diagrams and process flows.

Overview

Rehashing is the complex process by which a distributed cluster of Infinispan nodes are rebalanced; entries stored in the data grid are shuffled around the new consistent hash wheel for even distribution.  Some key considerations of this design are:

 

  • Allow multiple joiners/leavers.
  • Sacrifice performance for consistency.

 

The implementation of rehashing in 4.2.1 supported non-blocking rehashing, which allowed the state providers to continue their work while sending state to a joiner. This proved to be very fragile however, so we have removed it in 5.0.

 

Moreover, with virtual nodes enabled we the whole cluster will be involved in any rehash anyway so we are allowing the rehash process to block transactions in the entire cluster.

 

Detecting a change in the cluster

This is the starting point. The DistributionManager on all nodes will register for ViewChange events with the CacheManager.  This allows the DistributionManager to be notified of topology changes.

 

The nodes in the cluster can have three roles:

  • Permanent node: it is a member of the cluster both in the old view and in the new view
  • Joiner: it is a member of the cluster only in the new view
  • Leaver: it is a member of the cluster only in the old view

 

During a merge, all nodes act as permanent nodes.

RehashControlCommand

This implementation of the Command interface is used as a form of RPC between the members of the cluster to orchestrate a rehash. To minimise the number of command types we need, we use a single RehashControlCommand for the entire process, along with setting its Type - each type indicates a separate phase in the lifecycle of the rehash.

 

There are three RehashControlCommand types:

  • APPLY_STATE: rebalances state to a node, regardless if it's a result of a join, a leave or a merge
  • NODE_PUSH_COMPLETED: signals the coordinator that a node has finished pushing state
  • REHASH_COMPLETED: sent by the coordinator to the entire cluster, signals that the rehash is complete

Permanent node

  1. Initial state: permanent node A has view V0, consistent hash CH0.
  2. Receive a new view (V1)
  3. Block transactions
  4. Update CH with a new CH (CH1) based on V1
  5. Find state on which we are the "main" owner and send asynchronously it to the new owners in CH1
  6. Wait until all receivers confirm that they received the state
  7. Notify the coordinator that we have finished rebalancing
  8. Wait until the coordinator signals the entire cluster has finished rehashing
  9. Invalidate keys for which we are no longer an owner
  10. Unblock transactions

 

If A receives a new view during step 8, it means a new rehash is pending and so it skips key invalidation.

Leaver

A leaver can either leave the cluster intentionally, in which case it notifies the coordinator before leaving, or unintentionally, in which case the leave is detected by the FD protocol running on the remaining nodes.

 

Since the leave can be unintentional, the leaver is not required to do anything during rehashing.

Joiner

The joiner doesn't have an initial view or initial data, so the algorithm is much simpler:

  1. Receive initial view (V0)
  2. Compute initial CH (CH0)
  3. Notify the coordinator that we have finished rebalancing
  4. Wait until the coordinator signals the entire cluster has finished rehashing
  5. Finish cache startup

Configuration options

This section sets out the configuration timeouts and other options that control rehashing, and discusses the tradeoffs each one represents.

  • hash.rehashEnabled - enables or disables the entire rehashing algorithm
  • hash.rehashRpcTimeout - timeout for rehash RPCs, longer than the normal RPC timeout
  • l1.onRehash - if enabled, keys for which we are no longer an owner are moved to L1, otherwise they are deleted

 

Concurrent transactions

The rehashing implementation in version 4.x supported concurrent transactions. The new implementation in 5.0 no longer supports concurrent transactions in order to make the algorithm more reliable.

  • On a joiner, we don't do anything to prevent transactions, but no client code can get a cache reference until the join is complete.
  • On a permanent node, we block all writes in the locking interceptor (DistLockingInterceptor). This means writes will not reach the joiner either.

 

Retry queue

The logic behind the retry queue is that in the inbound invocation handler, invocations may block until the cache instance is ready to deal with requests.  Previously, this was handled by providing callers with an appropriate response type and force them to replay.  This approach is clunky when distribution is involved, so queueing on the receiver makes more sense.

 

The InboundInvocationHandler may block for 1 of 3 reasons:

  • The (replicated state transfer) processing lock cannot be acquired.  This means a transaction log is being drained.
  • The dist manager hasn't finished joining.
  • The cache hasn't started.

 

The idea here is that the call will block for a while, after which, if it still cannot proceed, it is placed on a queue and the originator sees a RequestIgnoredResponse, which is a SuccessfulResponse.  The IIH maintains one queue per named cache, and each queue has a dedicated retry thread.  New calls are placed on the queue, and even when the cache is ready to accept calls, the queue is processed first before new calls are allowed to proceed.

 

Two exceptions exist to the above. 

  • If the call is a RehashControlCommand, it is allowed to proceed immediately.
  • If the call is a ClusteredGetCall, and the wait times out, the clustered get isn't enqueued for retry since the caller would have already seen a RequestIgnoredResponse.