3 Replies Latest reply on May 11, 2012 11:33 AM by galder.zamarreno

Clustered cache recovery after network failure

dudalov May 1, 2012 8:07 PM

Could you please direct me to the docs that describe best practices for re-joining a cluster which may happen after network failures? I see a good discussion https://community.jboss.org/message/537786, but it's two years old. Any updates? What seems to be best practice for the such cases?

1. Losing a node. I guess when a cluster breaks apart then each node should proceed on its own using cached objects in their current state. But can we be sure that cache is always consistent across the nodes?

2. Re-joining the cluster. Should all nodes drop the existing cache and start from the scratch?

Thanks,

Dmitry

1. Re: Clustered cache recovery after network failure

galder.zamarreno May 3, 2012 3:43 AM (in response to dudalov)

I'm not aware of further work having been done wrt state merging.

Dropping the contents upon rejoining is probably the safest option.

We're working on an eventual consistency protocol that should help with these cases where multiple different versions of the data are found in the cluster and users need to provide some logic to consolidate it, see https://issues.jboss.org/browse/ISPN-999
1 of 1 people found this helpful
Actions
2. Re: Clustered cache recovery after network failure

dudalov May 9, 2012 3:06 PM (in response to galder.zamarreno)

Thank you, Galder!

More questions. How exactly can I do it? Should I listen for @Merged and drop the content of all caches on every merging node after receiving a MergeEvent? Do I need to do it for some events only, like when ViewChangedEvent.isMergeView is true? Sorry, it's not clear from javadoc.

And to be on safe side it should be done for all nodes, right?

What about transactions? By any chance, it won't break any running transactions, will it?
Actions
3. Re: Clustered cache recovery after network failure

galder.zamarreno May 11, 2012 11:33 AM (in response to dudalov)

Listen for @Merge and receive a MergeEvent. That should work.

Deal it with in all nodes and clear the local cache contents. Do this locally to avoid all nodes trying to wipe all data at the same time and crossing each other.

If you clear it, clear will normally wait for locks, so won't break existing transactions, at least not those based on data already read.
1 of 1 people found this helpful
Actions

Go to original post