Cross Datacenter Replication - design

This document contain the design suggestions for the cross datacentre replication functionality in Infinispan (ISPN-1824).

It is assumed that, due to latency, the preferred way of replication between two datacentre is async. Sync replication should also be supported, but not as the critical path.

 

 

Use Cases

The following use cases that require x-data center replication were identified:

 

1. Hot-standby for geographic failover (master/slave)

E.g. a medical system that has critical availability requirements and which deploys the data in two sites, e.g. London and New Castle.

- write to a single site, the other site is only needed if the master is down

- async replication/eventual consistency(most likely)

- access pattern suggestion - chained Hotrod clients

- state transfer for when a site starts is critical

- possibly smaller hot-standby cluster

- as an alternative use case the slave can be for offline data-mining, ETL

 

2. Follow the sun

E.g. an international bank that wants to have the data closer to where the active users are: e.g. in the morning the  LON data center is active, then NYC and then SFO.

- only one site active at a time, e.g. Lon in the morning and SFO in the evening

- data between data centers is replicated async

- more than 2 sites might be required

- concurrent writes between sites: it should be possible to gracefully shutdown one site by allowing the existing clients to finish their work and  direct all new clients to the other site. This might result in concurrent writes between the sites: old clients writing a key in the site that's about to shutdown and new clients writing it in the new site. This needs to be handled.

 

3. Geographic data partitioning with failover

E.g. a international company that has clients accessing their data in a 'local' datacenter.: European clients access LON center, NYC datacenter and SFO serve US clients based on their location.

- it should be possible to configure only one backup for each site, e.g. LON, besides serving local clients, also acts as a backup for NYC, which acts as a backup for SFO (LON's backup is SFO).

- data is geographically partitioned: that means that when all sites are up, there won't be concurrent writes *on the same key* but happening on two different sites (thanks to Erik Salter for sharing his usecase)

- this should also be supported(Erik S): all the sites contain all the data. Each site has a primary and a secondary backup, where the primary holds 2 copies of the data and the secondary holds 1 copy

- another requirement from (Erik S): between-site consistency should not be lost for transactional writes (sync replication).

Suggested solutions

Following approaches are currently considered:

 

1. Virtual Cluster

 

This is  an extension of jgroup's RELAY protocol.

x-datasenterrepl-relay2.png

The diagram above shows two datacentres, one in London(LON) and one in San Francisco(SFO). Each datacentre runs 4 Infinispan nodes.

The bridge between the two local clusters is another cluster formed between the coordinators A and X: a separate jgroups channel, most likely configured over TCP.

JGroups/RELAY forms a virtual cluster containing the nodes in both sites, e.g. a potential view would be {X,A,B,Y,Z,D,T, C}.

In order to make sure that data is present in both sites, the TopologyAwareConsistentHash (TACH) is used. All the nodes must have the siteId configured, i.e. A, B,C, D  have siteId="LON" and X,Y,Z,T have siteId="SFO".

E.g. when D.put(k,v) happens (nowOwners=2), the TACH, based on the siteId config, distributes the data to both sites. consistentHash(k) might be {B, T}. The sequence of RPCs involveed in the put operation is:

  • sync parallel RPC from D to B and A
  • A sends (async) a RPC to X, over RELAY
  • X forwards the RPC to T

Pros

  • this approach lavarages  the existing Infinispan state transfer functionality for handling node/sites leaving/starting
  • the code is already integrated in Infinispan 4.2.x (not production ready). However the replication between the sites is done sync by default - and that needs to change.

  • relay takes care of retransmission in the case an async call fails. This is critical for consistency (not implemented yet but will be as part of JGRP-1401)
  • can handle concurrent writes (see Follow the sun/concurrent writes)

Cons

  • when a node joins, the nodes from all the sites are involved in rehashing. This can be further optimized though, to only include the local nodes as state providers
  • the lock owner might be positioned on the remote cluster and accessing it would be very costly. As long as we don't support concurrent writes between the sites this can be mitigated by forcing a local lock only - this is intrusive though (se below).
  • intrusive
    • reads should only go to local nodes
    • writes should be configured to send the local messages sync and the remote messages async
    • whilst the prepare/commit sequence must be executed sync on the local cluster, we  only send remotely(async) information about the final result of the transaction - so the transaction code  requires changes as well
  • constraints on the configuration. e.g. it is not possible to have an active site with numOwners=4 and an backup site with numOwner=2, as numOwners is configured for the entire cluster
  • When shutting down one site the other one would be involved in the rehash. This would affect running site's performance and is not desirable for certain use cases. Erik Salter raised this concern for his use case: gracefully shutting down one site, i.e. existing clients to still be connected but new one to be migrated to another site.

2. HotRod-based

 

In this design the two sites form disjunct clusters that communicate between them over hotrod (async). Each node acts both as a HR client for the other site and as a HR server to which nodes in the other site connect to.

Pros

  • state transfer between clusters is supported through the use of RemoteCacheStore, which runs over hotrod
  • the number of network roundtrips is reduced as we don't have designated nodes in the middle to route the calls, but the clients connect to the remote data owner directly (using HR's smart routing)

Cons

  • perhaps the most important cons is the fact that, the async HR calls are not reliable: if a client crashes before the async queue is flushed the information is lost for good and sites are out of sync. Retransmission logic  needs to be added into the HR clients for this.
  • there is a mash of connections between sites. This might cause an administrative burden e.g. adding a new node to a site might require opening a firewall port. In certain environment this can take a lot of time.
  • the data that is written through HR can only be read through HR. e.g. hrClient.put(k,v). nodeA.get(k) would return null, as HR server encodes both the key and the value in a specialized structure. This can be overcome with a server-side interceptor.

 

3. Custom bridge

 

In this scenario the two sites form disjunct clusters, like in 2. The communication between the cluster is done by an custom bridge: similar to the RELAY link, but which doesn't build a virtual view over the two clusters. Considering the diagram at 1:

  • on each site an end-point is designated to handle the replication between the sites: A on LON and X on SFO
  • D.put(k,v) with consitentHash(k) = {B, C} is handled as follows
    • the put is sent in parallel to the owners (B and C) and to the end-point (A)
    • A bridges the put command and sends it over to X (async)
    • X gets the command and replays it locally
    • in the case of the transactions, it would be the commit message that triggers the sending of transaction information to the remote site

Pros

  • a subset of RELAY can be used for bridging, including the retransmission logic
  • less intrusive than 1; also supports heterogeneous sites
  • doesn't have the connection-mash from 2, at the cost of some additional RPCs

 

Cons

  • the state transfer needs to be customized to send all cluster's state through the end-point. Also need to make sure that the end point doesn't get overloaded during state transfer.
  • can't handle concurrent writes (see  "Follow the sun/concurrent writes section").