5 Replies Latest reply: Apr 16, 2012 11:19 AM by Mitchell Ackerman RSS

Unable to join hotrod cluster across regions

Mitchell Ackerman Newbie

I am finding it impossible to establish a hotrod cluster between certain nodes in our cluster.  We have 2 nodes in N America, for which we have no problems establishing a hotrod cluster.  However, when we try to add a cluster member from either Asia Pacific or Europe, after a short while we are getting TimeoutExceptions, which then results in the new cluster member(s) being dropped.  I have also synchronized the clocks on the nodes, they all use UTC.

 

Does anyone have any suggestions as to how to get this cluster to work?

 

We see the following errors in the logs

 

 

Coordinator node (US):

 

remote node joins:

 

2012-03-14 16:04:14,581 DEBUG (OOB-2,hotrod-dev,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] ___hotRodTopologyCache: Node ip-10-81-208-97-19453 is joining

2012-03-14 16:04:14,584 DEBUG (CacheViewInstaller-2,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] Installing new view CacheView{viewId=2, members=[ip-10-81-0-227-56141, ip-10-81-208-97-19453]} for cache ___hotRodTopologyCache

 

 

300000 millisecs later (the distributed timeout value):

 

 

2012-03-14 16:09:39,084 ERROR (CacheViewInstaller-2,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] ISPN000172: Failed to prepare view CacheView{viewId=2, members=[ip-10-81-0-227-56141, ip-10-81-208-97-19453]} for cache  P, rolling back to view CacheView{viewId=1, members=[ip-10-81-0-227-56141]}

java.util.concurrent.TimeoutException

        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)

        at java.util.concurrent.FutureTask.get(FutureTask.java:91)

        at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:322)

        at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)

        at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:876)

        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

        at java.util.concurrent.FutureTask.run(FutureTask.java:138)

        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        at java.lang.Thread.run(Thread.java:662)

2012-03-14 16:10:24,094 DEBUG (Timer-3,hotrod-dev,ip-10-81-0-227-56141) [org.jgroups.protocols.FD] sending are-you-alive msg to ip-10-81-208-97-19453 (own address=ip-10-81-0-227-56141)

2012-03-14 16:10:38,834 ERROR (CacheViewInstaller-3,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] ISPN000172: Failed to prepare view CacheView{viewId=2, members=[ip-10-81-0-227-56141, ip-10-81-208-97-19453]} for cache  R, rolling back to view CacheView{viewId=1, members=[ip-10-81-0-227-56141]}

java.util.concurrent.TimeoutException

        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)

        at java.util.concurrent.FutureTask.get(FutureTask.java:91)

        at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:319)

        at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)

        at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:876)

        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

        at java.util.concurrent.FutureTask.run(FutureTask.java:138)

        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        at java.lang.Thread.run(Thread.java:662)

 

 

Asia node:

 

 

org.infinispan.CacheException: Unable to invoke method private void org.infinispan.statetransfer.BaseStateTransferManagerImpl.start() throws java.lang.Exception on object

        at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:236)

        at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:875)

        at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:630)

        at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:619)

        at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:523)

        at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:173)

        at org.infinispan.CacheImpl.start(CacheImpl.java:496)

        at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:624)

        at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:514)

        at org.infinispan.server.hotrod.HotRodServer$$anonfun$preStartCaches$1.apply(HotRodServer.scala:114)

        at org.infinispan.server.hotrod.HotRodServer$$anonfun$preStartCaches$1.apply(HotRodServer.scala:112)

        at scala.collection.Iterator$class.foreach(Iterator.scala:660)

        at scala.collection.JavaConversions$JIteratorWrapper.foreach(JavaConversions.scala:573)

        at org.infinispan.server.hotrod.HotRodServer.preStartCaches(HotRodServer.scala:112)

        at org.infinispan.server.hotrod.HotRodServer.startTransport(HotRodServer.scala:101)

        at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:100)

        at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:95)

        at org.infinispan.server.core.Main$.boot(Main.scala:140)

        at org.infinispan.server.core.Main$$anon$1.call(Main.scala:94)

        at org.infinispan.server.core.Main$$anon$1.call(Main.scala:91)

        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)

        at java.util.concurrent.FutureTask.run(Unknown Source)

        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)

        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)

        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

        at java.lang.Thread.run(Unknown Source)

Caused by: java.lang.reflect.InvocationTargetException

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

        at java.lang.reflect.Method.invoke(Unknown Source)

        at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:234)

        ... 26 more

Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 5 minutes waiting for a response from ip-10-81-0-227-56141

        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:271)

        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:111)

        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:447)

        at org.infinispan.cacheviews.CacheViewsManagerImpl.join(CacheViewsManagerImpl.java:214)

        at org.infinispan.statetransfer.BaseStateTransferManagerImpl.start(BaseStateTransferManagerImpl.java:139)

        ... 31 more

 

 

I have increased timeout values, but this does not seem to have any effect. 

 

 

Our configuration files are as follows:

 

 

hotrod-config.xml

 

 

<?xml version="1.0" encoding="UTF-8"?>

<infinispan

      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

      xsi:schemaLocation="urn:infinispan:config:5.1 http://www.infinispan.org/schemas/infinispan-config-5.1.xsd"

      xmlns="urn:infinispan:config:5.1">

 

 

   <global>

        <transport clusterName="hotrod-dev" distributedSyncTimeout="300000">

                <properties>

                        <property name="configurationFile" value="/opt/infinispan-5.1.1.FINAL/etc/gossip-router-config.xml"/>

                </properties>

        </transport>

      <globalJmxStatistics enabled="true"/>

   </global>

 

 

   <default>

      <jmxStatistics enabled="true"/>

      <clustering mode="R">

                <stateRetrieval timeout="300000"/>

      </clustering>

   </default>

 

 

   <namedCache name="A"/>

   <namedCache name="B"/>

   <namedCache name="P"/>

   <namedCache name="M"/>

</infinispan>

 

 

gossip-router-config.xml (i removed FD_SOCK to see if that helped, but same behaviour either way)

 

 

<?xml version="1.0" encoding="UTF-8"?>

<config>

   <TCP bind_port="7900"/>

   <TCPGOSSIP timeout="3000" initial_hosts="10.81.0.227[8800]" num_initial_members="3"/>

   <MERGE2 max_interval="30000" min_interval="10000"/>

<!--   <FD_SOCK/> -->

   <FD timeout="50000" max_tries="5"/>

   <VERIFY_SUSPECT timeout="5000"/>

   <pbcast.NAKACK use_mcast_xmit="false" retransmit_timeout="300,600,1200,2400,4800" discard_delivered_msgs="true"/>

   <UNICAST timeout="300,600,1200,2400,3600"/>

   <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400000"/>

   <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/>

   <UFC max_credits="2000000" min_threshold="0.10"/>

   <MFC max_credits="2000000" min_threshold="0.10"/>

   <FRAG2 frag_size="60000"/>

</config>

 

 

startup command:

 

 

/opt/infinispan-5.1.1.FINAL/bin/startServer.sh -Djava.net.preferIPv4Stack=true -Djgroups.bind_addr=10.81.208.97 --cache_config=/opt/infinispan-5.1.1.FINAL/etc/hotrod-config.xml --protocol=hotrod --host=10.81.208.97 -Dlog4j.configuration=file:///opt/infinispan-5.1.1.FINAL/etc/hotrod-log4j.xml

 

Environments are:

US nodes (note that we have has no issues mixing Ubuntu & CentOS in the US)

 

Ubuntu, running

 

java version "1.6.0_26"

Java(TM) SE Runtime Environment (build 1.6.0_26-b03)

Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

 

infinispan-5.1.1.FINAL

 

Asia & Europe

 

CentOS (5.7 & 6.2), running

 

java version "1.6.0_30"

Java(TM) SE Runtime Environment (build 1.6.0_30-b12)

Java HotSpot(TM) 64-Bit Server VM (build 20.5-b03, mixed mode)

 

infinispan-5.1.1.FINAL

 

thanks, Mitchell

  • 1. Re: Unable to join hotrod cluster across regions
    Mitchell Ackerman Newbie

    The problem appears to be that our cache size and data transfer rates are such that it takes longer than the distributed timeout values to complete the synchronization.  I have tried increasing that value, but other timers kick in which is terminating the sync.

     

    Does anyone know what combination of timeouts is necessary to have a distributed sync in excess of an hour?

     

    thanks, Mitchell

  • 2. Re: Unable to join hotrod cluster across regions
    Galder Zamarreño Master

    Hmmm, assuming this 1h is needed for state transfer, you'd need to adjust:

    - lock.lockAcquisitionTimeout

    - sync.replTimeout

    - stateRetrieval.timeout

     

    So, stateRetrieval.timeout > sync.replTimeout > lock.lockAcquisitionTimeout

     

    i.e. 2h > 3m > 1m

     

    Also, if state transfer is slow, you might wanna configure a cluster cache loader instead where data is retrieved lazily by nodes rather than having all data transfered on startup.

     

    Btw, we're working on a higher abstraction for inter data-centre replication. See http://infinispan.blogspot.com/2012/02/cross-datacenter-replication-request.html for more info on what we're working on. I think you'd benefit from it.

  • 3. Re: Unable to join hotrod cluster across regions
    Mitchell Ackerman Newbie

    Thanks Galder, however, the addition of lockAcquisitionTimeout does not work:

     

    2012-03-21 17:33:06,409 ERROR (CacheViewInstaller-2,ip-10-81-0-26-1016) [org.infinispan.cacheviews.CacheViewsManagerImpl] ISPN000172: Failed to prepare view CacheView{viewId=2, members=[ip-10-81-0-26-1016, ip-10-81-208-97-36164]} for cache  A, rolling back to view CacheView{viewId=1, members=[ip-10-81-0-26-1016]}

    java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 minutes waiting for a response from ip-10-81-208-97-36164

            at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)

            at java.util.concurrent.FutureTask.get(FutureTask.java:91)

            at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:322)

            at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)

            at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:876)

            at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

            at java.util.concurrent.FutureTask.run(FutureTask.java:138)

            at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

            at java.lang.Thread.run(Thread.java:662)

    Caused by: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 minutes waiting for a response from ip-10-81-208-97-36164

            at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)

            at java.util.concurrent.FutureTask.get(FutureTask.java:91)

            at org.infinispan.util.concurrent.AggregatingNotifyingFutureBuilder.get(AggregatingNotifyingFutureBuilder.java:93)

            at org.infinispan.statetransfer.BaseStateTransferTask.finishPushingState(BaseStateTransferTask.java:139)

            at org.infinispan.statetransfer.ReplicatedStateTransferTask.doPerformStateTransfer(ReplicatedStateTransferTask.java:116)

            at org.infinispan.statetransfer.BaseStateTransferTask.performStateTransfer(BaseStateTransferTask.java:93)

            at org.infinispan.statetransfer.BaseStateTransferManagerImpl.prepareView(BaseStateTransferManagerImpl.java:294)

            at org.infinispan.cacheviews.CacheViewsManagerImpl.handlePrepareView(CacheViewsManagerImpl.java:486)

            at org.infinispan.cacheviews.CacheViewsManagerImpl$3.call(CacheViewsManagerImpl.java:313)

            ... 5 more

    Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 minutes waiting for a response from ip-10-81-208-97-36164

            at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:271)

            at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:111)

            at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:447)

            at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:148)

            at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:169)

            at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:219)

            at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:78)

            at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:249)

            ... 5 more

     

    our hotrod-config now looks like:

     

    <?xml version="1.0" encoding="UTF-8"?>

    <infinispan

          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

          xsi:schemaLocation="urn:infinispan:config:5.1 http://www.infinispan.org/schemas/infinispan-config-5.1.xsd"

          xmlns="urn:infinispan:config:5.1">

     

     

       <global>

            <transport clusterName="hotrod-test" distributedSyncTimeout="16000000">

                    <properties>

                            <property name="configurationFile" value="/opt/infinispan-5.1.1.FINAL/etc/gossip-router-config.xml"/>

                    </properties>

            </transport>

          <globalJmxStatistics enabled="true"/>

       </global>

     

     

       <default>

          <jmxStatistics enabled="true"/>

          <locking

             lockAcquisitionTimeout="16000000"

          />

     

     

          <clustering mode="replicated">

     

     

             <sync replTimeout="16000000"/>

     

                     <!-- for replication -->

                     <stateRetrieval timeout="16000000"/>

             <stateTransfer timeout="16000000"/>

     

     

          </clustering>

       </default>

     

     

       <namedCache name="A"/>

       <namedCache name="B"/>

       <namedCache name="C"/>

       <namedCache name="D"/>

    </infinispan>

     

    Using a lazy cache loader would not help our situation, as we are using replicated caches for redundancy (& performance), where a hotrod server could be restarted at any time and would have to replicate the current state of the cache, which, in general, would be most of the cache data.

     

    Also, i've looked at the blog you mentioned.  Are you looking for input, use cases?

     

    thanks, Mitchell

  • 4. Re: Unable to join hotrod cluster across regions
    Galder Zamarreño Master

    Do one thing, add the following to <clustering> element and try again:

     

    <hash rehashRpcTimeout="16000000" />

     

    Also, can you try with Infinispan 5.1.3.CR1? We've made some improvements to speed up state transfer.

     

    Not sure I understand why the lazy cache loader won't help you. Are you worried about losing data if you restart nodes and suddenly, since data has not be retrieved by the client and this has not been transfered to the new node, this data is lost?

  • 5. Re: Unable to join hotrod cluster across regions
    Mitchell Ackerman Newbie

    We have made a couple of changes to our environment, and are now able to get a hotrod cluster up.  Firstly, we were able to half our network latency.  With 5.1.1 we were still not able to get a cluster up and running, however, after building with 5.1.3.FINAL we can now successfully do so. 

     

    Fyi, increasing rehashRpcTimeout did not help.

     

    thanks, Mitchell