8 Replies Latest reply: Mar 14, 2013 7:08 PM by Vladimir Blagojevic RSS

Distributed Task Failover on Node Failure

Ovidiu Feodorov Master

According to the current documentation (https://docs.jboss.org/author/display/ISPN/Infinispan+Distributed+Execution+Framework#InfinispanDistributedExecutionFramework-Distributedtaskfailoverandmigration), an Infinispan cluster should detect a node failure and migrate a distributed task currently running on that node on the next suitable node.

 

I have tried simulating this scenario with 5.1.2.FINAL (and I have reasons to suspect 5.1.4.FINAL behaves similarly):

 

1) three node cluster (A-***, B-*** and C-***)

2) a distributed task submitted in parallel on all nodes with submitEverywhere(distributedCallable) and no input keys from A-***

3) killing the node B-*** - not the one that initiated the callable - while the task was running on it.

 

The node failure has been detected by the cluster, which performed a view change, but instead of the expected result (three futures that return valid results, albeit one not computed on the node that died, but on a backup node), I have seen:

 

> got response from C-17379

 

Exception in thread "main" java.util.concurrent.ExecutionException: org.infinispan.CacheException: SuspectedException

        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)

        at java.util.concurrent.FutureTask.get(FutureTask.java:83)

        at org.infinispan.distexec.DefaultExecutorService$DistributedRunnableFuture.get(DefaultExecutorService.java:557)

        at com.novaordis.playground.infinispan.command.LaunchDistributedCallable.execute(LaunchDistributedCallable.java:117)

        at com.novaordis.playground.infinispan.Main.readCommandsFromCommandLineAndPassThemToNode(Main.java:78)

        at com.novaordis.playground.infinispan.Main.main(Main.java:39)

Caused by: org.infinispan.CacheException: SuspectedException

        at org.infinispan.util.Util.rewrapAsCacheException(Util.java:524)

        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:168)

        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:478)

        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:148)

        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:169)

        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:219)

        at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:78)

        at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:249)

        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

        at java.util.concurrent.FutureTask.run(FutureTask.java:138)

        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        at java.lang.Thread.run(Thread.java:662)

Caused by: SuspectedException

        at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:349)

        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:263)

        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:163)

        ... 11 more

 

 

It is my understanding of the distributed task migration mechanism correct, and my expectations valid?

 

If no, could you please point me to the right direction? What exactly does "task migration" mean and what result is expected for the scenario presented above?

 

If yes, is this a feature not implemented yet (as the documentation seem to suggest?)

 

I have a command line testing tool that makes simulating all these various scenarios easy, and I will be delighted to share it with the dev team, if they believe this case is worth investigating and get to the bottom to.

 

This thread is related to https://community.jboss.org/message/731545, I just took the failure detection out of the picture; failure detection works fine with proper tunning.

 

Thanks,
Ovidiu