5 Replies Latest reply on Apr 16, 2012 11:19 AM by mackerman

Unable to join hotrod cluster across regions

mackerman Mar 14, 2012 12:30 PM

I am finding it impossible to establish a hotrod cluster between certain nodes in our cluster. We have 2 nodes in N America, for which we have no problems establishing a hotrod cluster. However, when we try to add a cluster member from either Asia Pacific or Europe, after a short while we are getting TimeoutExceptions, which then results in the new cluster member(s) being dropped. I have also synchronized the clocks on the nodes, they all use UTC.

Does anyone have any suggestions as to how to get this cluster to work?

We see the following errors in the logs

Coordinator node (US):

remote node joins:

2012-03-14 16:04:14,581 DEBUG (OOB-2,hotrod-dev,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] ___hotRodTopologyCache: Node ip-10-81-208-97-19453 is joining

2012-03-14 16:04:14,584 DEBUG (CacheViewInstaller-2,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] Installing new view CacheView{viewId=2, members=[ip-10-81-0-227-56141, ip-10-81-208-97-19453]} for cache ___hotRodTopologyCache

300000 millisecs later (the distributed timeout value):

2012-03-14 16:09:39,084 ERROR (CacheViewInstaller-2,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] ISPN000172: Failed to prepare view CacheView{viewId=2, members=[ip-10-81-0-227-56141, ip-10-81-208-97-19453]} for cache P, rolling back to view CacheView{viewId=1, members=[ip-10-81-0-227-56141]}

java.util.concurrent.TimeoutException

at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)

at java.util.concurrent.FutureTask.get(FutureTask.java:91)

at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:322)

at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)

at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:876)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

2012-03-14 16:10:24,094 DEBUG (Timer-3,hotrod-dev,ip-10-81-0-227-56141) [org.jgroups.protocols.FD] sending are-you-alive msg to ip-10-81-208-97-19453 (own address=ip-10-81-0-227-56141)

2012-03-14 16:10:38,834 ERROR (CacheViewInstaller-3,ip-10-81-0-227-56141) [org.infinispan.cacheviews.CacheViewsManagerImpl] ISPN000172: Failed to prepare view CacheView{viewId=2, members=[ip-10-81-0-227-56141, ip-10-81-208-97-19453]} for cache R, rolling back to view CacheView{viewId=1, members=[ip-10-81-0-227-56141]}

java.util.concurrent.TimeoutException

at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228)

at java.util.concurrent.FutureTask.get(FutureTask.java:91)

at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:319)

at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)

at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:876)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

Asia node:

org.infinispan.CacheException: Unable to invoke method private void org.infinispan.statetransfer.BaseStateTransferManagerImpl.start() throws java.lang.Exception on object

at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:236)

at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:875)

at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:630)

at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:619)

at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:523)

at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:173)

at org.infinispan.CacheImpl.start(CacheImpl.java:496)

at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:624)

at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:514)

at org.infinispan.server.hotrod.HotRodServer$$anonfun$preStartCaches$1.apply(HotRodServer.scala:114)

at org.infinispan.server.hotrod.HotRodServer$$anonfun$preStartCaches$1.apply(HotRodServer.scala:112)

at scala.collection.Iterator$class.foreach(Iterator.scala:660)

at scala.collection.JavaConversions$JIteratorWrapper.foreach(JavaConversions.scala:573)

at org.infinispan.server.hotrod.HotRodServer.preStartCaches(HotRodServer.scala:112)

at org.infinispan.server.hotrod.HotRodServer.startTransport(HotRodServer.scala:101)

at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:100)

at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:95)

at org.infinispan.server.core.Main$.boot(Main.scala:140)

at org.infinispan.server.core.Main$$anon$1.call(Main.scala:94)

at org.infinispan.server.core.Main$$anon$1.call(Main.scala:91)

at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)

at java.util.concurrent.FutureTask.run(Unknown Source)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

at java.lang.Thread.run(Unknown Source)

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

at java.lang.reflect.Method.invoke(Unknown Source)

at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:234)

... 26 more

Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 5 minutes waiting for a response from ip-10-81-0-227-56141

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:271)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:111)

at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:447)

at org.infinispan.cacheviews.CacheViewsManagerImpl.join(CacheViewsManagerImpl.java:214)

at org.infinispan.statetransfer.BaseStateTransferManagerImpl.start(BaseStateTransferManagerImpl.java:139)

... 31 more

I have increased timeout values, but this does not seem to have any effect.

Our configuration files are as follows:

hotrod-config.xml

<?xml version="1.0" encoding="UTF-8"?>

<infinispan

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:infinispan:config:5.1 http://www.infinispan.org/schemas/infinispan-config-5.1.xsd"

xmlns="urn:infinispan:config:5.1">

</properties>

</transport>

</global>

</clustering>

</default>

</infinispan>

gossip-router-config.xml (i removed FD_SOCK to see if that helped, but same behaviour either way)

<?xml version="1.0" encoding="UTF-8"?>

<VERIFY_SUSPECT timeout="5000"/>

<pbcast.NAKACK use_mcast_xmit="false" retransmit_timeout="300,600,1200,2400,4800" discard_delivered_msgs="true"/>

<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400000"/>

<pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/>

</config>

startup command:

/opt/infinispan-5.1.1.FINAL/bin/startServer.sh -Djava.net.preferIPv4Stack=true -Djgroups.bind_addr=10.81.208.97 --cache_config=/opt/infinispan-5.1.1.FINAL/etc/hotrod-config.xml --protocol=hotrod --host=10.81.208.97 -Dlog4j.configuration=file:///opt/infinispan-5.1.1.FINAL/etc/hotrod-log4j.xml

Environments are:

US nodes (note that we have has no issues mixing Ubuntu & CentOS in the US)

Ubuntu, running

java version "1.6.0_26"

Java(TM) SE Runtime Environment (build 1.6.0_26-b03)

Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

infinispan-5.1.1.FINAL

Asia & Europe

CentOS (5.7 & 6.2), running

java version "1.6.0_30"

Java(TM) SE Runtime Environment (build 1.6.0_30-b12)

Java HotSpot(TM) 64-Bit Server VM (build 20.5-b03, mixed mode)

infinispan-5.1.1.FINAL

thanks, Mitchell

1. Re: Unable to join hotrod cluster across regions

mackerman Mar 15, 2012 11:04 AM (in response to mackerman)

The problem appears to be that our cache size and data transfer rates are such that it takes longer than the distributed timeout values to complete the synchronization. I have tried increasing that value, but other timers kick in which is terminating the sync.

Does anyone know what combination of timeouts is necessary to have a distributed sync in excess of an hour?

thanks, Mitchell
Actions
2. Re: Unable to join hotrod cluster across regions

galder.zamarreno Mar 19, 2012 12:25 PM (in response to mackerman)

Hmmm, assuming this 1h is needed for state transfer, you'd need to adjust:
- lock.lockAcquisitionTimeout
- sync.replTimeout
- stateRetrieval.timeout

So, stateRetrieval.timeout > sync.replTimeout > lock.lockAcquisitionTimeout

i.e. 2h > 3m > 1m

Also, if state transfer is slow, you might wanna configure a cluster cache loader instead where data is retrieved lazily by nodes rather than having all data transfered on startup.

Btw, we're working on a higher abstraction for inter data-centre replication. See http://infinispan.blogspot.com/2012/02/cross-datacenter-replication-request.html for more info on what we're working on. I think you'd benefit from it.
Actions
3. Re: Unable to join hotrod cluster across regions

mackerman Mar 21, 2012 1:42 PM (in response to galder.zamarreno)

Thanks Galder, however, the addition of lockAcquisitionTimeout does not work:

2012-03-21 17:33:06,409 ERROR (CacheViewInstaller-2,ip-10-81-0-26-1016) [org.infinispan.cacheviews.CacheViewsManagerImpl] ISPN000172: Failed to prepare view CacheView{viewId=2, members=[ip-10-81-0-26-1016, ip-10-81-208-97-36164]} for cache A, rolling back to view CacheView{viewId=1, members=[ip-10-81-0-26-1016]}
java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 minutes waiting for a response from ip-10-81-208-97-36164
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
        at java.util.concurrent.FutureTask.get(FutureTask.java:91)
        at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterPrepareView(CacheViewsManagerImpl.java:322)
        at org.infinispan.cacheviews.CacheViewsManagerImpl.clusterInstallView(CacheViewsManagerImpl.java:250)
        at org.infinispan.cacheviews.CacheViewsManagerImpl$ViewInstallationTask.call(CacheViewsManagerImpl.java:876)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 minutes waiting for a response from ip-10-81-208-97-36164
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
        at java.util.concurrent.FutureTask.get(FutureTask.java:91)
        at org.infinispan.util.concurrent.AggregatingNotifyingFutureBuilder.get(AggregatingNotifyingFutureBuilder.java:93)
        at org.infinispan.statetransfer.BaseStateTransferTask.finishPushingState(BaseStateTransferTask.java:139)
        at org.infinispan.statetransfer.ReplicatedStateTransferTask.doPerformStateTransfer(ReplicatedStateTransferTask.java:116)
        at org.infinispan.statetransfer.BaseStateTransferTask.performStateTransfer(BaseStateTransferTask.java:93)
        at org.infinispan.statetransfer.BaseStateTransferManagerImpl.prepareView(BaseStateTransferManagerImpl.java:294)
        at org.infinispan.cacheviews.CacheViewsManagerImpl.handlePrepareView(CacheViewsManagerImpl.java:486)
        at org.infinispan.cacheviews.CacheViewsManagerImpl$3.call(CacheViewsManagerImpl.java:313)
        ... 5 more
Caused by: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 minutes waiting for a response from ip-10-81-208-97-36164
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:271)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:111)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:447)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:148)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:169)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:219)
        at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:78)
        at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:249)
        ... 5 more

our hotrod-config now looks like:

<?xml version="1.0" encoding="UTF-8"?>
<infinispan
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="urn:infinispan:config:5.1 http://www.infinispan.org/schemas/infinispan-config-5.1.xsd"
      xmlns="urn:infinispan:config:5.1">

   <global>
        <transport clusterName="hotrod-test" distributedSyncTimeout="16000000">
                <properties>
                        <property name="configurationFile" value="/opt/infinispan-5.1.1.FINAL/etc/gossip-router-config.xml"/>
                </properties>
        </transport>
      <globalJmxStatistics enabled="true"/>
   </global>

   <default>
      <jmxStatistics enabled="true"/>
      <locking
         lockAcquisitionTimeout="16000000"
      />

      <clustering mode="replicated">

         <sync replTimeout="16000000"/>

                 
                 <stateRetrieval timeout="16000000"/>
         <stateTransfer timeout="16000000"/>

      </clustering>
   </default>

   <namedCache name="A"/>
   <namedCache name="B"/>
   <namedCache name="C"/>
   <namedCache name="D"/>
</infinispan>

Using a lazy cache loader would not help our situation, as we are using replicated caches for redundancy (& performance), where a hotrod server could be restarted at any time and would have to replicate the current state of the cache, which, in general, would be most of the cache data.

Also, i've looked at the blog you mentioned. Are you looking for input, use cases?

thanks, Mitchell
Actions
4. Re: Unable to join hotrod cluster across regions

galder.zamarreno Mar 23, 2012 7:16 AM (in response to mackerman)

Do one thing, add the following to <clustering> element and try again:

<hash rehashRpcTimeout="16000000" />

Also, can you try with Infinispan 5.1.3.CR1? We've made some improvements to speed up state transfer.

Not sure I understand why the lazy cache loader won't help you. Are you worried about losing data if you restart nodes and suddenly, since data has not be retrieved by the client and this has not been transfered to the new node, this data is lost?
1 of 1 people found this helpful
Actions
5. Re: Unable to join hotrod cluster across regions

mackerman Apr 16, 2012 11:19 AM (in response to galder.zamarreno)

We have made a couple of changes to our environment, and are now able to get a hotrod cluster up. Firstly, we were able to half our network latency. With 5.1.1 we were still not able to get a cluster up and running, however, after building with 5.1.3.FINAL we can now successfully do so.

Fyi, increasing rehashRpcTimeout did not help.

thanks, Mitchell
Actions

Go to original post