9 Replies Latest reply on Nov 26, 2014 9:09 PM by rtom

Distributed Task Failover on Node Failure

ovidiu.feodorov May 1, 2012 3:21 AM

According to the current documentation (https://docs.jboss.org/author/display/ISPN/Infinispan+Distributed+Execution+Framework#InfinispanDistributedExecutionFramework-Distributedtaskfailoverandmigration), an Infinispan cluster should detect a node failure and migrate a distributed task currently running on that node on the next suitable node.

I have tried simulating this scenario with 5.1.2.FINAL (and I have reasons to suspect 5.1.4.FINAL behaves similarly):

1) three node cluster (A-***, B-*** and C-***)

2) a distributed task submitted in parallel on all nodes with submitEverywhere(distributedCallable) and no input keys from A-***

3) killing the node B-*** - not the one that initiated the callable - while the task was running on it.

The node failure has been detected by the cluster, which performed a view change, but instead of the expected result (three futures that return valid results, albeit one not computed on the node that died, but on a backup node), I have seen:

> got response from C-17379

Exception in thread "main" java.util.concurrent.ExecutionException: org.infinispan.CacheException: SuspectedException

at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)

at java.util.concurrent.FutureTask.get(FutureTask.java:83)

at org.infinispan.distexec.DefaultExecutorService$DistributedRunnableFuture.get(DefaultExecutorService.java:557)

at com.novaordis.playground.infinispan.command.LaunchDistributedCallable.execute(LaunchDistributedCallable.java:117)

at com.novaordis.playground.infinispan.Main.readCommandsFromCommandLineAndPassThemToNode(Main.java:78)

at com.novaordis.playground.infinispan.Main.main(Main.java:39)

Caused by: org.infinispan.CacheException: SuspectedException

at org.infinispan.util.Util.rewrapAsCacheException(Util.java:524)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:168)

at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:478)

at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:148)

at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:169)

at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:219)

at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:78)

at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:249)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

Caused by: SuspectedException

at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:349)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:263)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:163)

... 11 more

It is my understanding of the distributed task migration mechanism correct, and my expectations valid?

If no, could you please point me to the right direction? What exactly does "task migration" mean and what result is expected for the scenario presented above?

If yes, is this a feature not implemented yet (as the documentation seem to suggest?)

I have a command line testing tool that makes simulating all these various scenarios easy, and I will be delighted to share it with the dev team, if they believe this case is worth investigating and get to the bottom to.

This thread is related to https://community.jboss.org/message/731545, I just took the failure detection out of the picture; failure detection works fine with proper tunning.

Thanks,
Ovidiu

1. Re: Distributed Task Failover on Node Failure

vblagojevic May 1, 2012 10:52 AM (in response to ovidiu.feodorov)

Ovidiu, task failover is not implemented yet! It is planned for 6.0; as soon as failover SPI is done I'll ping you for review!

Regards,
Vladimir
Actions
2. Re: Distributed Task Failover on Node Failure

ovidiu.feodorov May 1, 2012 11:41 AM (in response to vblagojevic)

Thanks, I am staying tuned.
Actions
3. Re: Distributed Task Failover on Node Failure

lazetics Mar 6, 2013 1:32 PM (in response to ovidiu.feodorov)

Hi Vladimir,

Does version 5.2.1.Final support distributed task failover and custom implementations of DistributedTaskFailoverPolicy? In documentations for this version it says that it does, also I see it in the API, still I cannot make it working.

Thanks,
Strahinja
Actions
4. Re: Distributed Task Failover on Node Failure

vblagojevic Mar 6, 2013 7:01 PM (in response to lazetics)

Strahinja,

It should work. We have unit tests that specifically test this feature. What are you observing?

Regards,
Vladimir
Actions
5. Re: Distributed Task Failover on Node Failure

lazetics Mar 7, 2013 8:13 AM (in response to vblagojevic)

Hi Vladimir,

Thanks for the quick reply. I tested it only on one machine, starting several instances of cache and creating the cluster. When I start my DistributedCallable in one process it is sometimes migrated to the another process, so I believe the cluster is working. Still, when I kill the process where the DistributedCallable is executing, it does not start it again on another process. I tried both with RANDOM_NODE_FAILOVER and my custom Failover policy.
Is this supposed to work like I just described or I am missing something?

Thanks,
Strahinja
Actions
6. Re: Distributed Task Failover on Node Failure

vblagojevic Mar 7, 2013 10:28 AM (in response to lazetics)

Strahinja,

It should work, are you sure node is being killed while DistributedCallable is executing? If so, and it still does not work, open a JIRA report with your example code!

Regards,
Vladimir
Actions
7. Re: Distributed Task Failover on Node Failure

lazetics Mar 14, 2013 9:35 AM (in response to vblagojevic)

Hi Vladimir,

Here are my updated observations. It looks like when the node which actually starts a DistributedTask is killed (i.e. the one which waits on "future.get()"), the task cannot be re-executed, whether the task was running on that node or on some other node which was killed later on. Did you test this scenario?

Thanks,
Strahinja
Actions
8. Re: Distributed Task Failover on Node Failure

vblagojevic Mar 14, 2013 7:08 PM (in response to lazetics)

Strahinja,

This is not a supported feature. If master node dies, task is lost. I am not even sure how we would support master task node failover and a task reference recovery in a non-convoluted way.

Regards,
Vladimir
Actions
9. Re: Distributed Task Failover on Node Failure

rtom Nov 26, 2014 9:09 PM (in response to vblagojevic)

Hi Vladimir,

I'm also having some issues with failover. I'm using Infinispan in embedded mode. I have two servers, and when one of the nodes are running a DistributedTask and I kill the node, the task is not re-executed. Judging from the documentation it seems like the Distributed Execution Framework should provide for failover capability as long as I create a DistributedCallable and run it as a DistributedTask configured as:

taskBuilder = taskBuilder.failoverPolicy(DefaultExecutorService.RANDOM_NODE_FAILOVER);

Do I need to be Infinispan in client-server mode instead (using Hot Rod/Memcached etc...)?

Thanks,
Ryan
Actions

Go to original post