Possible locking improvements

Note: infinispan does not acquire locks for reads, so by “lock” I always mean write lock, i.e. lock that is being acquired when data is written to the cluster.


 

     1.

During a transaction locks are acquired locally before commit time.

E.g. on same node two tx running:

Tx1: writes on a -> b -> c

Tx2: writes on a -> c -> d

  • these two transactions will execute in sequence, as after Tx1 locks “a” Tx2 won’t be able to make any progress until Tx1 is finsihed.
  • remote transactions that need to acquire a remote lock on “a” (for prepare) won’t be able to do any progress until Tx1 (and potentially Tx2) are completed.

 

Suggested optimization:  do not acquire any locks until prepare time. Use same behavior as  when reading the keys. This would reduce lock’s scope and give better throughput by allowing more parallelism between transactions.

 

Also a current problem is that the locking behaviour might affect "only" the local node if this happens to not be an owner, or it could block all other nodes wanting the lock if it happens to be an owner. So the behaviour is different if the node happens to be the coordinator of the involved keys or not.

 


     2.

Non transactional writes from two nodes on the same key (tx situation discussed at a further point).

 

 

1.png

 

 

Let’s say we have key “a” so that consistentHash(“a”) = {N3, N4}.  Two threads write on key “a” on nodes N1 and N2, at the same time. With the right timing RPCs can take place in the order indicated in the above diagram. Both 3 and 4 deadlock in this case, and the resulting delay won't affect only these keys but also all other locks the two transactions might already have acquired.

 

Possible optimization: 

 

 

2.png

 

Disregarding numOwners always go to the main owner first (in this case N3). Acquire lock on it and pass it the job of multicast to the remaining nodes:

  • this makes sure that there won’t be deadlocks when multiple nodes write on the same key
  • user performance is affected as 1 and 2 are now executed in sequence
  • optimization stands for high contention on same key
  • also valid for situations where async replication is used as it offers better overall consistency and potentially better throughput by not having deadlocks

     3.

This one is about another approach for deadlock avoidance. Optimization is for distributed caches, without eager locking (default config).

 

Tx1: writes on “a” then “b”

Tx2: writes on “b” then “a”

With some “right” timing => deadlock during prepare time.

 

Suggested optimization:

  • during prepare for each transaction order the keys based on their consistent hash value
  • acquire the locks in this sequence
  • this means that locks are NOT acquired in the order in which the corresponding operation has happened
  • this will assure that there won’t be a deadlock between the two transaction

Notes:

  • might not work for keys that hash to the same value. For these we can use key’s native hash code - this is valid if Object.hashCode is overridden. Optionally we can expose a pluggable user callback to tell us the ordering.
  • this can be extended to replicated and local caches as well.
  • it is simpler and more efficient than the current DLD mechanism as it doesn’t force one TX to rollback.

     4.

This is similar to 2 but extended to transactions. Let’s say we have Tx1 on node N1 writing to {a,b,c} and Tx2 on N2 writing {d,e,c}. Also consistentHash(c) = {N3, N4}

 

3.png

If the prepare RPCs happen in the order shown on above diagram then:

  • Tx1 has lock on c@N3
  • Tx2 has lock on c@N4
  • RPC 3 and 4 block and Tx1 and Tx2 are in a deadlock
  • not handled by current DLD mechanism as e.g. on node N4 there’s no way to determine who owns the lock on N3 without querying N3 (RPC) for this.
  • what happens?
    • locks on {a,b,c,d,e,f} are being held for (potentially) lockAcquisitionTimeout - defaults to 10 seconds
    • at least of of Tx1/Tx2 won’t succeed - potentially both!
    • bad for throughput!!!

 

Suggested optimization:

  • only acquire lock on the main data owner. mainDataOwner(key) =  consistentHash(key).get(0). In the above example the main data owner for “c” is N3
  • multicast the prepare message to other backups as well, just don’t acquire locks there.

 

 

Given the example, what would be the output with the optimization in place?

Tx1 won’t block on Tx2 when acquiring lock on N4 (RPC 3 in above diagram). It will be able to complete and, after releasing locks allow Tx2 to complete as well.

What happens when N3 fails after prepare but before receiving the commit message? At this point Tx1 is prepared but dosn’t have any cluster lock acquired for it - what’s to be done?  One way of solving the situation is by marking the transaction for rollback. Here is a potentially better approach though:

  • N4 is the new data owen for “c”. It has both Tx1 and Tx2 prepare state on it
  • N4 knows that “c” was locked on main owner, but doesn’t know which of Tx1 and Tx2 was the lock owner.
  • when topology change notification is received on N4 it locks “c” on behalf of both Tx1 and Tx2 (synthetic lock). This makes sure that another tx won’t be able to acquire locks on “c”
  • N1 will send the commit message to N4 and a new backup (N5). N4 receives the commit and “passes” the syntetic lock to Tx1 which completes. Lock is then released and Tx2 can progress.


 

 

     5. Replicated Keys & Values, non-replicated Locks


Currently each key owner has a copy of the value, both to serve it to requestors and as backup for the other owners. Locking the key implies that the lock command is sent and must succeed on each owner. The idea here is to have values fail over as they currently do, but always choose a single lock owner, being able to infer the lock state from the other nodes in case this single lock owner where to crash. Let's call this node the lock coordinator of a specific key.

 

There are different ideas to reliably infer the lock state from the other nodes in case the current lock coordinator leaves the cluster:

 

  1. the newly elected lock coordinator doesn't concede the lock to anybody until it got a "it's ok for me" message from each other node.
  2. the new lock coordinator doesn't concede the lock for some time (configurable, typically very short), in which it's expected to get a message from the lock owner to restore the state of the lock. If the current lock owner doesn't show up, the lock is considered free and it will be assigned to one of the waiters; if an owning node shows up too late it will have to rollback its running transactions.
  3. the new lock coordinator might give the lock away to the first node asking for it, assuming it's free; it might still receive a "I had it!" message from some node, in which case the lock state is restored, but if it happened that in the meantime the lock was granted to someone else it will have to rollback. This is equivalent to solution 2 with a delay set to zero.


And this brings several benefits:

  • A single lock operation per key means that deadlocks are more unlikely than as they are now - totally avoided if combined with lock ordering as in other proposals.
  • Less network traffic and network delays in lock acquisition.
  • Each node will receive less commands to perform (it will have to handle locking only for the keys it owns as primary, instead of for all keys it owns) - so it scales better with higher numOwners.
  • There's one single coordinator for a lock - simpler.
  • If the lock owner crashes, and the primary key owner as well, the lock information is lost and not recoverable. This is actually a good thing, as the owning transaction is dead as well and so other processes can acquire the lock without waiting for the dead one to timeout.

 

Possible problems:

  • If must be considered that both transactional and non transactional processes might use eager locks to "make sure they'll get the lock", but this might in fact reveal to be not the case later on as their eager lock might be considered invalid later on. I don't think this is a big problem as
    • it's quite unlikely - less than a real failing database in which case you'd have to deal with it anyway
    • you must be prepared to deal with it as it could happen in other cases as well - i.e. if your transaction times out