9 Replies Latest reply on Nov 26, 2014 9:09 PM by rtom

    Distributed Task Failover on Node Failure

    ovidiu.feodorov

      According to the current documentation (https://docs.jboss.org/author/display/ISPN/Infinispan+Distributed+Execution+Framework#InfinispanDistributedExecutionFramework-Distributedtaskfailoverandmigration), an Infinispan cluster should detect a node failure and migrate a distributed task currently running on that node on the next suitable node.

       

      I have tried simulating this scenario with 5.1.2.FINAL (and I have reasons to suspect 5.1.4.FINAL behaves similarly):

       

      1) three node cluster (A-***, B-*** and C-***)

      2) a distributed task submitted in parallel on all nodes with submitEverywhere(distributedCallable) and no input keys from A-***

      3) killing the node B-*** - not the one that initiated the callable - while the task was running on it.

       

      The node failure has been detected by the cluster, which performed a view change, but instead of the expected result (three futures that return valid results, albeit one not computed on the node that died, but on a backup node), I have seen:

       

      > got response from C-17379

       

      Exception in thread "main" java.util.concurrent.ExecutionException: org.infinispan.CacheException: SuspectedException

              at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)

              at java.util.concurrent.FutureTask.get(FutureTask.java:83)

              at org.infinispan.distexec.DefaultExecutorService$DistributedRunnableFuture.get(DefaultExecutorService.java:557)

              at com.novaordis.playground.infinispan.command.LaunchDistributedCallable.execute(LaunchDistributedCallable.java:117)

              at com.novaordis.playground.infinispan.Main.readCommandsFromCommandLineAndPassThemToNode(Main.java:78)

              at com.novaordis.playground.infinispan.Main.main(Main.java:39)

      Caused by: org.infinispan.CacheException: SuspectedException

              at org.infinispan.util.Util.rewrapAsCacheException(Util.java:524)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:168)

              at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:478)

              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:148)

              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:169)

              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:219)

              at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:78)

              at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:249)

              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

              at java.util.concurrent.FutureTask.run(FutureTask.java:138)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

              at java.lang.Thread.run(Thread.java:662)

      Caused by: SuspectedException

              at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:349)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:263)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:163)

              ... 11 more

       

       

      It is my understanding of the distributed task migration mechanism correct, and my expectations valid?

       

      If no, could you please point me to the right direction? What exactly does "task migration" mean and what result is expected for the scenario presented above?

       

      If yes, is this a feature not implemented yet (as the documentation seem to suggest?)

       

      I have a command line testing tool that makes simulating all these various scenarios easy, and I will be delighted to share it with the dev team, if they believe this case is worth investigating and get to the bottom to.

       

      This thread is related to https://community.jboss.org/message/731545, I just took the failure detection out of the picture; failure detection works fine with proper tunning.

       

      Thanks,
      Ovidiu

        • 1. Re: Distributed Task Failover on Node Failure
          vblagojevic

          Ovidiu, task failover is not implemented yet! It is planned for 6.0; as soon as failover SPI is done I'll ping you for review!

           

          Regards,

          Vladimir

          • 2. Re: Distributed Task Failover on Node Failure
            ovidiu.feodorov

            Thanks, I am staying tuned.

            • 3. Re: Distributed Task Failover on Node Failure
              lazetics

              Hi Vladimir,

               

              Does version 5.2.1.Final support distributed task failover and custom implementations of DistributedTaskFailoverPolicy? In documentations for this version it says that it does, also I see it in the API, still I cannot make it working.

               

              Thanks,

              Strahinja

              • 4. Re: Distributed Task Failover on Node Failure
                vblagojevic

                Strahinja,

                 

                It should work. We have unit tests that specifically test this feature. What are you observing?

                 

                Regards,

                Vladimir

                • 5. Re: Distributed Task Failover on Node Failure
                  lazetics

                  Hi Vladimir,

                   

                  Thanks for the quick reply. I tested it only on one machine, starting several instances of cache and creating the cluster. When I start my DistributedCallable in one process it is sometimes migrated to the another process, so I believe the cluster is working. Still, when I kill the process where the DistributedCallable is executing, it does not start it again on another process. I tried both with RANDOM_NODE_FAILOVER and my custom Failover policy.

                  Is this supposed to work like I just described or I am missing something?

                   

                  Thanks,

                  Strahinja

                  • 6. Re: Distributed Task Failover on Node Failure
                    vblagojevic

                    Strahinja,

                     

                    It should work, are you sure node is being killed while DistributedCallable is executing? If so, and it still does not work, open a JIRA report with your example code! 

                     

                    Regards,

                    Vladimir

                    • 7. Re: Distributed Task Failover on Node Failure
                      lazetics

                      Hi Vladimir,

                       

                      Here are my updated observations. It looks like when the node which actually starts a DistributedTask is killed (i.e. the one which waits on "future.get()"), the task cannot be re-executed, whether the task was running on that node or on some other node which was killed later on. Did you test this scenario?

                       

                      Thanks,

                      Strahinja

                      • 8. Re: Distributed Task Failover on Node Failure
                        vblagojevic

                        Strahinja,

                         

                        This is not a supported feature. If master node dies, task is lost. I am not even sure how we would support master task node failover and a task reference recovery in a non-convoluted way.

                         

                        Regards,

                        Vladimir

                        • 9. Re: Distributed Task Failover on Node Failure
                          rtom

                          Hi Vladimir,

                           

                          I'm also having some issues with failover. I'm using Infinispan in embedded mode. I have two servers, and when one of the nodes are running a DistributedTask and I kill the node, the task is not re-executed. Judging from the documentation it seems like the Distributed Execution Framework should provide for failover capability as long as I create a DistributedCallable and run it as a DistributedTask configured as:

                           

                          taskBuilder = taskBuilder.failoverPolicy(DefaultExecutorService.RANDOM_NODE_FAILOVER);

                           

                          Do I need to be Infinispan in client-server mode instead (using Hot Rod/Memcached etc...)?

                           

                          Thanks,

                          Ryan