Transaction recovery design

Context

     Infinispan's XAResource implementation ignores the XAResource.recover() and XAResource.forget() method. As a consequence infininispan transactions are vulnerable to inconsistencies (aCid) when a failure happens in between prepare and commit. 

     E.g. consider a distributed transaction in which an user transfers money from an account stored in the database to an account stored in infinispan. When TM.commit() is invoked, both resources prepare successfully(1st phase). During commit (2nd phase), the database successfully applies the changes whilst infinspan node fails before receiving the commit request from the TM (this exact scenario is handled in 2nd diagram). At this point the system is in an inconssistent state. Recovery deals with this situation (possibly requiring manual intervention) to make sure data ends up in a consistent state.

 

 

Design

 

The following sequence diagram show the intended design for implementing recovery functionality.

 

Happy flow

1_happy_flow.png1- Happy Flow

Notes:

  • Step 3.1.1.2 - after preparing we hold a reference to the transactions's XID on all the backup nodes. This is a proof that the transaction was prepared successfully and it is neeeded in case the transaction originator (Node1) fails before commit (see next scenario)
  • Step 3.2.2 - if the transaction is successfully completed (that includes rollbacks) then this reference is forgotten.

 

Originator fails

 

2_originator_fails.png

2-Originator fails

Notes:

  • When the node is restarted the recovery process is triggered. This calls XAResource.recover in step 4.1.
  • Because the starting node has no idea about where the tx originated here were replicated, 4.1.1 needs to be a broadcast
  • This call is not only performed at startup but it is an async periodic process(in JBossTM by default every 2 mins)
  • Optimisation: the broadcast can be replaced by nodes proactivelly sending (unicast) their known lists of prepared transactions to the originator. Out of scope for first version.
  • Step 4.2 - the TM comparing the prepared transactions with tx determines that 2PC was not successfull and notifies admin
  • At this moment (i.e. after a node failing) all transactions's changes are in the system (assuming numOwners >= 2). The adimistrator should be able to either rollback or commit the transaction (JMX hooks, TBD).

 

Another node (not originator) fails

3_non_originator_failure.png

3-Non orginator fails

 

Replicating recovery information

For distributed clusters the number of nodes where prepare RPC is multicasted is variable, depending both on number of keys wrote by that transaction(txSize) and the numOwners. E.g. numOwners=2 and a transactions that writes on 2 keys(i.e. txSize=2) would multicast a prepare to max 4 nodes and min 2 nodes. Generally for the number of nodes to which prepare is sent (noPrepares) satisfy the following condition: txSize * numOwners >= noPrepares >= numOwner. All the nodes that receive the prepare request (noPrepares) would hold this information (next releases, depending on community demand, might allow a configurable number of backups for this recovery info). 

 

Cleanup registered tx from RecoveryManager

Following will make sure that transactions registered with the RecoveryManager are cleaned up:

  1. The RecoveryManager stores the recovery related information in a local cache instance. This cache instance is configured by the user (see the XML Configuration section). This offers out of the box the following features:
    1. expiration can be configured
    2. passivation can be configured, particularily important  in order to avoid OOM
  2. On stable clusters they are removed (async) durign commit (step 3.2.2 on 1-Happy Flow)
  3. If a node fails before commit, then the tx is removed when node restarts and TM calls XAResource.forget (step 4.3 on 2-Originator fails)

 

New Classes

The class diagram below highlights the new classes added to the system.

  static_view.png4-New Classes

The key element in the above diagram is the RecoveryManager:

  1. Its state is compund of all the prepared and not yet commited transactions in the system. Memory wise, this is the whole new state added to the system
  2. getClusterPreparedTransactions broadcasts an GetPreparedTxCommand, which returns list of prepared Xids from each node. It is called by XAResource.recover()
  3. forgetRemoteTransactions async multicasts ForgetTxCommand
  4. purgePreparedTransactions is called periodically to cleanup the transactions for which the originating TransactionManager was never resurrected (see previous section on Cleanup registered...)

XML configuration

Recovery is configured per cache, within the transaction xml tag.

<transaction>
      <recovery enabled="true" recoveryInfoCacheName="recoveryCache"/>
</transaction>

Observations:

  1. recovery is a new tag, nested within  transaction
  2. if enabled=false then no recovery information is maintained and XAResource.recover is ignored (warning logged on call). Defaults to true.
  3. recoveryInfoCacheName - the name of the LOCAL cache where the recovery information is stored. If not specified, defaults to a local cache that would expire entries older than 6 hours.

JMX tooling for handling in doubt transactions

When the TransactionManager determines that a tx is in-doubt it informs the System Administrator(SA) about it. At this point the SA only knows the Xid (i.e. byte array) of the in-doubt transaction and needs to take action to complete it  by either committing or rolling it back. Following operations are exposed through JMX in order to support the recovery process:

 

showInDoubtTransactions(): (xid:Xid, internalId:Long, status:String)

This returns all the transactions that are in-doubt. The transactions are present to the SA as a tuple composed from:

  • Xid - a string representation of transactions's id as received from the TransactionManager
  • internalId - this is a unique per Xid system generated id. SA uses this value for completing the transaction (see next operations). The SA has to visually maps the Xid of the transaction, as received from the TransactioManager's recovery process, to this value. This Xid->long mapping is needed as passing the Xid, which is basically a byte[], to the JMX operations is a more complicated approach.  
  • status - a string describing the state in which transactions is, e.g. "prepared" or "huristically committed" (the former indicates tx is prepared on all nodes but commit hasn't been received, the latter means transactions was committed on a subset of the nodes, but not on all of them).

 

forceRollback(internalId:Long) : String
forceCommit(internalId:Long) : String

These operations receive an internal transaction id (as previously described) and force transaction's completion by either  reapplying the chances or rolling them back. The value returned by this operation is a String informing user weather the operation was successful or not. If it isn't successful the user needs to re-run it manually.

 

forceRollback(formatId:int, globalTxId:byte[], branchQualifier:byte[]) : String
forceCommit(formatId:int, globalTxId:byte[], branchQualifier:byte[]) : String

These operations have the same semantic as the ones described in the previous section. The difference lays in the fact that in this case the transaction is identified through the Xid directly. Whilst these are not useful for manual intervention, they can be used for triggering an automated process: e.g. TranasctionManager's recovery process, when finding a in-doubt transaction, calls some java code that invokes these operations through JMX.

 

forget(internalId:Long): String
forget(formatId:int, globalTxId:byte[], branchQualifier:byte[]):String

These operations remove from the system the recovery information related to a transaction. Although this is normally a prerogative of the TransactionManager, these operations are added here for completeness.