According to the current documentation (https://docs.jboss.org/author/display/ISPN/Infinispan+Distributed+Execution+Framework#InfinispanDistributedExecutionFramework-Distributedtaskfailoverandmigration), an Infinispan cluster should detect a node failure and migrate a distributed task currently running on that node on the next suitable node.
I have tried simulating this scenario with 5.1.2.FINAL (and I have reasons to suspect 5.1.4.FINAL behaves similarly):
1) three node cluster (A-***, B-*** and C-***)
2) a distributed task submitted in parallel on all nodes with submitEverywhere(distributedCallable) and no input keys from A-***
3) killing the node B-*** - not the one that initiated the callable - while the task was running on it.
The node failure has been detected by the cluster, which performed a view change, but instead of the expected result (three futures that return valid results, albeit one not computed on the node that died, but on a backup node), I have seen:
> got response from C-17379
Exception in thread "main" java.util.concurrent.ExecutionException: org.infinispan.CacheException: SuspectedException
Caused by: org.infinispan.CacheException: SuspectedException
Caused by: SuspectedException
... 11 more
It is my understanding of the distributed task migration mechanism correct, and my expectations valid?
If no, could you please point me to the right direction? What exactly does "task migration" mean and what result is expected for the scenario presented above?
If yes, is this a feature not implemented yet (as the documentation seem to suggest?)
I have a command line testing tool that makes simulating all these various scenarios easy, and I will be delighted to share it with the dev team, if they believe this case is worth investigating and get to the bottom to.
This thread is related to https://community.jboss.org/message/731545, I just took the failure detection out of the picture; failure detection works fine with proper tunning.
Thanks for the quick reply. I tested it only on one machine, starting several instances of cache and creating the cluster. When I start my DistributedCallable in one process it is sometimes migrated to the another process, so I believe the cluster is working. Still, when I kill the process where the DistributedCallable is executing, it does not start it again on another process. I tried both with RANDOM_NODE_FAILOVER and my custom Failover policy.
Is this supposed to work like I just described or I am missing something?
Here are my updated observations. It looks like when the node which actually starts a DistributedTask is killed (i.e. the one which waits on "future.get()"), the task cannot be re-executed, whether the task was running on that node or on some other node which was killed later on. Did you test this scenario?