5 Replies Latest reply on Apr 16, 2012 11:19 AM by mackerman

    Unable to join hotrod cluster across regions

    mackerman

      I am finding it impossible to establish a hotrod cluster between certain nodes in our cluster.  We have 2 nodes in N America, for which we have no problems establishing a hotrod cluster.  However, when we try to add a cluster member from either Asia Pacific or Europe, after a short while we are getting TimeoutExceptions, which then results in the new cluster member(s) being dropped.  I have also synchronized the clocks on the nodes, they all use UTC.

       

      Does anyone have any suggestions as to how to get this cluster to work?

       

      We see the following errors in the logs

       

       

      Coordinator node (US):

       

      remote node joins:

       

      2012-03-14 16:04:14,581 DEBUG (OOB-2,hotrod-dev,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] ___hotRodTopologyCache: Node ip-10-81-208-97-19453 is joining

      2012-03-14 16:04:14,584 DEBUG (CacheViewInstaller-2,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] Installing new view CacheView{viewId=2, members=[ip-10-81-0-227-56141, ip-10-81-208-97-19453]} for cache ___hotRodTopologyCache

       

       

      300000 millisecs later (the distributed timeout value):

       

       

      2012-03-14 16:09:39,084 ERROR (CacheViewInstaller-2,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] ISPN000172: Failed to prepare view CacheView{viewId=2, members=[ip-10-81-0-227-56141, ip-10-81-208-97-19453]} for cache  P, rolling back to view CacheView{viewId=1, members=[ip-10-81-0-227-56141]}

      java.util.concurrent.TimeoutException

              at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)

              at java.util.concurrent.FutureTask.get(FutureTask.java:91)

              at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:322)

              at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)

              at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:876)

              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

              at java.util.concurrent.FutureTask.run(FutureTask.java:138)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

              at java.lang.Thread.run(Thread.java:662)

      2012-03-14 16:10:24,094 DEBUG (Timer-3,hotrod-dev,ip-10-81-0-227-56141) [org.jgroups.protocols.FD] sending are-you-alive msg to ip-10-81-208-97-19453 (own address=ip-10-81-0-227-56141)

      2012-03-14 16:10:38,834 ERROR (CacheViewInstaller-3,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] ISPN000172: Failed to prepare view CacheView{viewId=2, members=[ip-10-81-0-227-56141, ip-10-81-208-97-19453]} for cache  R, rolling back to view CacheView{viewId=1, members=[ip-10-81-0-227-56141]}

      java.util.concurrent.TimeoutException

              at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)

              at java.util.concurrent.FutureTask.get(FutureTask.java:91)

              at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:319)

              at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)

              at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:876)

              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

              at java.util.concurrent.FutureTask.run(FutureTask.java:138)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

              at java.lang.Thread.run(Thread.java:662)

       

       

      Asia node:

       

       

      org.infinispan.CacheException: Unable to invoke method private void org.infinispan.statetransfer.BaseStateTransferManagerImpl.start() throws java.lang.Exception on object

              at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:236)

              at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:875)

              at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:630)

              at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:619)

              at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:523)

              at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:173)

              at org.infinispan.CacheImpl.start(CacheImpl.java:496)

              at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:624)

              at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:514)

              at org.infinispan.server.hotrod.HotRodServer$$anonfun$preStartCaches$1.apply(HotRodServer.scala:114)

              at org.infinispan.server.hotrod.HotRodServer$$anonfun$preStartCaches$1.apply(HotRodServer.scala:112)

              at scala.collection.Iterator$class.foreach(Iterator.scala:660)

              at scala.collection.JavaConversions$JIteratorWrapper.foreach(JavaConversions.scala:573)

              at org.infinispan.server.hotrod.HotRodServer.preStartCaches(HotRodServer.scala:112)

              at org.infinispan.server.hotrod.HotRodServer.startTransport(HotRodServer.scala:101)

              at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:100)

              at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:95)

              at org.infinispan.server.core.Main$.boot(Main.scala:140)

              at org.infinispan.server.core.Main$$anon$1.call(Main.scala:94)

              at org.infinispan.server.core.Main$$anon$1.call(Main.scala:91)

              at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)

              at java.util.concurrent.FutureTask.run(Unknown Source)

              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)

              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              at java.lang.Thread.run(Unknown Source)

      Caused by: java.lang.reflect.InvocationTargetException

              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

              at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

              at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

              at java.lang.reflect.Method.invoke(Unknown Source)

              at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:234)

              ... 26 more

      Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 5 minutes waiting for a response from ip-10-81-0-227-56141

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:271)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:111)

              at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:447)

              at org.infinispan.cacheviews.CacheViewsManagerImpl.join(CacheViewsManagerImpl.java:214)

              at org.infinispan.statetransfer.BaseStateTransferManagerImpl.start(BaseStateTransferManagerImpl.java:139)

              ... 31 more

       

       

      I have increased timeout values, but this does not seem to have any effect. 

       

       

      Our configuration files are as follows:

       

       

      hotrod-config.xml

       

       

      <?xml version="1.0" encoding="UTF-8"?>

      <infinispan

            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

            xsi:schemaLocation="urn:infinispan:config:5.1 http://www.infinispan.org/schemas/infinispan-config-5.1.xsd"

            xmlns="urn:infinispan:config:5.1">

       

       

         <global>

              <transport clusterName="hotrod-dev" distributedSyncTimeout="300000">

                      <properties>

                              <property name="configurationFile" value="/opt/infinispan-5.1.1.FINAL/etc/gossip-router-config.xml"/>

                      </properties>

              </transport>

            <globalJmxStatistics enabled="true"/>

         </global>

       

       

         <default>

            <jmxStatistics enabled="true"/>

            <clustering mode="R">

                      <stateRetrieval timeout="300000"/>

            </clustering>

         </default>

       

       

         <namedCache name="A"/>

         <namedCache name="B"/>

         <namedCache name="P"/>

         <namedCache name="M"/>

      </infinispan>

       

       

      gossip-router-config.xml (i removed FD_SOCK to see if that helped, but same behaviour either way)

       

       

      <?xml version="1.0" encoding="UTF-8"?>

      <config>

         <TCP bind_port="7900"/>

         <TCPGOSSIP timeout="3000" initial_hosts="10.81.0.227[8800]" num_initial_members="3"/>

         <MERGE2 max_interval="30000" min_interval="10000"/>

      <!--   <FD_SOCK/> -->

         <FD timeout="50000" max_tries="5"/>

         <VERIFY_SUSPECT timeout="5000"/>

         <pbcast.NAKACK use_mcast_xmit="false" retransmit_timeout="300,600,1200,2400,4800" discard_delivered_msgs="true"/>

         <UNICAST timeout="300,600,1200,2400,3600"/>

         <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400000"/>

         <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/>

         <UFC max_credits="2000000" min_threshold="0.10"/>

         <MFC max_credits="2000000" min_threshold="0.10"/>

         <FRAG2 frag_size="60000"/>

      </config>

       

       

      startup command:

       

       

      /opt/infinispan-5.1.1.FINAL/bin/startServer.sh -Djava.net.preferIPv4Stack=true -Djgroups.bind_addr=10.81.208.97 --cache_config=/opt/infinispan-5.1.1.FINAL/etc/hotrod-config.xml --protocol=hotrod --host=10.81.208.97 -Dlog4j.configuration=file:///opt/infinispan-5.1.1.FINAL/etc/hotrod-log4j.xml

       

      Environments are:

      US nodes (note that we have has no issues mixing Ubuntu & CentOS in the US)

       

      Ubuntu, running

       

      java version "1.6.0_26"

      Java(TM) SE Runtime Environment (build 1.6.0_26-b03)

      Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

       

      infinispan-5.1.1.FINAL

       

      Asia & Europe

       

      CentOS (5.7 & 6.2), running

       

      java version "1.6.0_30"

      Java(TM) SE Runtime Environment (build 1.6.0_30-b12)

      Java HotSpot(TM) 64-Bit Server VM (build 20.5-b03, mixed mode)

       

      infinispan-5.1.1.FINAL

       

      thanks, Mitchell

        • 1. Re: Unable to join hotrod cluster across regions
          mackerman

          The problem appears to be that our cache size and data transfer rates are such that it takes longer than the distributed timeout values to complete the synchronization.  I have tried increasing that value, but other timers kick in which is terminating the sync.

           

          Does anyone know what combination of timeouts is necessary to have a distributed sync in excess of an hour?

           

          thanks, Mitchell

          • 2. Re: Unable to join hotrod cluster across regions
            galder.zamarreno

            Hmmm, assuming this 1h is needed for state transfer, you'd need to adjust:

            - lock.lockAcquisitionTimeout

            - sync.replTimeout

            - stateRetrieval.timeout

             

            So, stateRetrieval.timeout > sync.replTimeout > lock.lockAcquisitionTimeout

             

            i.e. 2h > 3m > 1m

             

            Also, if state transfer is slow, you might wanna configure a cluster cache loader instead where data is retrieved lazily by nodes rather than having all data transfered on startup.

             

            Btw, we're working on a higher abstraction for inter data-centre replication. See http://infinispan.blogspot.com/2012/02/cross-datacenter-replication-request.html for more info on what we're working on. I think you'd benefit from it.

            • 3. Re: Unable to join hotrod cluster across regions
              mackerman

              Thanks Galder, however, the addition of lockAcquisitionTimeout does not work:

               

              2012-03-21 17:33:06,409 ERROR (CacheViewInstaller-2,ip-10-81-0-26-1016) [org.infinispan.cacheviews.CacheViewsManagerImpl] ISPN000172: Failed to prepare view CacheView{viewId=2, members=[ip-10-81-0-26-1016, ip-10-81-208-97-36164]} for cache  A, rolling back to view CacheView{viewId=1, members=[ip-10-81-0-26-1016]}

              java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 minutes waiting for a response from ip-10-81-208-97-36164

                      at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)

                      at java.util.concurrent.FutureTask.get(FutureTask.java:91)

                      at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:322)

                      at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)

                      at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:876)

                      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

                      at java.util.concurrent.FutureTask.run(FutureTask.java:138)

                      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

                      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

                      at java.lang.Thread.run(Thread.java:662)

              Caused by: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 minutes waiting for a response from ip-10-81-208-97-36164

                      at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)

                      at java.util.concurrent.FutureTask.get(FutureTask.java:91)

                      at org.infinispan.util.concurrent.AggregatingNotifyingFutureBuilder.get(AggregatingNotifyingFutureBuilder.java:93)

                      at org.infinispan.statetransfer.BaseStateTransferTask.finishPushingState(BaseStateTransferTask.java:139)

                      at org.infinispan.statetransfer.ReplicatedStateTransferTask.doPerformStateTransfer(ReplicatedStateTransferTask.java:116)

                      at org.infinispan.statetransfer.BaseStateTransferTask.performStateTransfer(BaseStateTransferTask.java:93)

                      at org.infinispan.statetransfer.BaseStateTransferManagerImpl.prepareView(BaseStateTransferManagerImpl.java:294)

                      at org.infinispan.cacheviews.CacheViewsManagerImpl.handlePrepareView(CacheViewsManagerImpl.java:486)

                      at org.infinispan.cacheviews.CacheViewsManagerImpl$3.call(CacheViewsManagerImpl.java:313)

                      ... 5 more

              Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 minutes waiting for a response from ip-10-81-208-97-36164

                      at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:271)

                      at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:111)

                      at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:447)

                      at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:148)

                      at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:169)

                      at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:219)

                      at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:78)

                      at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:249)

                      ... 5 more

               

              our hotrod-config now looks like:

               

              <?xml version="1.0" encoding="UTF-8"?>

              <infinispan

                    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

                    xsi:schemaLocation="urn:infinispan:config:5.1 http://www.infinispan.org/schemas/infinispan-config-5.1.xsd"

                    xmlns="urn:infinispan:config:5.1">

               

               

                 <global>

                      <transport clusterName="hotrod-test" distributedSyncTimeout="16000000">

                              <properties>

                                      <property name="configurationFile" value="/opt/infinispan-5.1.1.FINAL/etc/gossip-router-config.xml"/>

                              </properties>

                      </transport>

                    <globalJmxStatistics enabled="true"/>

                 </global>

               

               

                 <default>

                    <jmxStatistics enabled="true"/>

                    <locking

                       lockAcquisitionTimeout="16000000"

                    />

               

               

                    <clustering mode="replicated">

               

               

                       <sync replTimeout="16000000"/>

               

                               <!-- for replication -->

                               <stateRetrieval timeout="16000000"/>

                       <stateTransfer timeout="16000000"/>

               

               

                    </clustering>

                 </default>

               

               

                 <namedCache name="A"/>

                 <namedCache name="B"/>

                 <namedCache name="C"/>

                 <namedCache name="D"/>

              </infinispan>

               

              Using a lazy cache loader would not help our situation, as we are using replicated caches for redundancy (& performance), where a hotrod server could be restarted at any time and would have to replicate the current state of the cache, which, in general, would be most of the cache data.

               

              Also, i've looked at the blog you mentioned.  Are you looking for input, use cases?

               

              thanks, Mitchell

              • 4. Re: Unable to join hotrod cluster across regions
                galder.zamarreno

                Do one thing, add the following to <clustering> element and try again:

                 

                <hash rehashRpcTimeout="16000000" />

                 

                Also, can you try with Infinispan 5.1.3.CR1? We've made some improvements to speed up state transfer.

                 

                Not sure I understand why the lazy cache loader won't help you. Are you worried about losing data if you restart nodes and suddenly, since data has not be retrieved by the client and this has not been transfered to the new node, this data is lost?

                1 of 1 people found this helpful
                • 5. Re: Unable to join hotrod cluster across regions
                  mackerman

                  We have made a couple of changes to our environment, and are now able to get a hotrod cluster up.  Firstly, we were able to half our network latency.  With 5.1.1 we were still not able to get a cluster up and running, however, after building with 5.1.3.FINAL we can now successfully do so. 

                   

                  Fyi, increasing rehashRpcTimeout did not help.

                   

                  thanks, Mitchell