1 2 Previous Next 23 Replies Latest reply on Mar 9, 2012 5:00 AM by dan.berindei

StateTransferException with infinispan 5.0.0FINAL

shohou Sep 6, 2011 8:39 AM

Hi!

I'm getting StateTransferException with 5+ nodes in cluster on 3 different computers:

This one is on starting instance:

2011-09-02 10:11:20,822 - INFO  - GlobalComponentRegistry        - ISPN000128: Infinispan version: Infinispan 'Pagoa' 5.0.0.FINAL
2011-09-02 10:11:21,083 - INFO  - RpcManagerImpl                 - ISPN000074: Trying to fetch state from SERVER-1-43204
2011-09-02 10:11:23,093 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:11:53,122 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:11:53,241 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:12:23,265 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:12:23,282 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:12:53,306 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:12:53,324 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:13:53,392 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:13:53,406 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:14:23,432 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:14:23,448 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:14:53,474 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:14:53,486 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:15:21,554 - ERROR - JGroupsTransport               - ISPN000096: Caught while requesting or applying state
org.infinispan.statetransfer.StateTransferException: java.io.EOFException: The stream ended unexpectedly.  Please check whether the source of the stream encountered any issues generating the stream.
        at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:292)
        at org.infinispan.remoting.InboundInvocationHandlerImpl.applyState(InboundInvocationHandlerImpl.java:235)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.setState(JGroupsTransport.java:607)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:711)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
        at org.jgroups.JChannel.up(JChannel.java:1441)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER.java:523)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER.java:462)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:223)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:189)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:908)
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:246)
        at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:613)
        at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)
        at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:703)
        at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:133)
        at org.jgroups.protocols.FD.up(FD.java:275)
        at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:275)
        at org.jgroups.protocols.MERGE2.up(MERGE2.java:209)
        at org.jgroups.protocols.Discovery.up(Discovery.java:293)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1109)
        at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1665)
        at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1647)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.EOFException: The stream ended unexpectedly.  Please check whether the source of the stream encountered any issues generating the stream.
        at org.infinispan.marshall.VersionAwareMarshaller.objectFromObjectStream(VersionAwareMarshaller.java:193)
        at org.infinispan.statetransfer.StateTransferManagerImpl.processCommitLog(StateTransferManagerImpl.java:215)
        at org.infinispan.statetransfer.StateTransferManagerImpl.applyTransactionLog(StateTransferManagerImpl.java:237)
        at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:281)
        ... 28 more
Caused by: java.io.EOFException: Read past end of file
        at org.jboss.marshalling.SimpleDataInput.eofOnRead(SimpleDataInput.java:126)
        at org.jboss.marshalling.SimpleDataInput.readUnsignedByteDirect(SimpleDataInput.java:263)
        at org.jboss.marshalling.SimpleDataInput.readUnsignedByte(SimpleDataInput.java:224)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:209)
        at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:37)
        at org.infinispan.marshall.jboss.GenericJBossMarshaller.objectFromObjectStream(GenericJBossMarshaller.java:194)
        at org.infinispan.marshall.VersionAwareMarshaller.objectFromObjectStream(VersionAwareMarshaller.java:191)
        ... 31 more

And this are on the instance that provides state:

2011-09-06 12:07:25,725 -- ERROR [::] - RpcManagerImpl - ISPN000073: Unexpected error while replicating
java.util.concurrent.TimeoutException: Timed out waiting for a cluster-wide sync to be released. (timeout = 240 seconds)
        at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.blockUntilReleased(JGroupsDistSync.java:78)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:436)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:132)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:156)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:265)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:252)
        at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:235)
        at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:228)
        at org.infinispan.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:116)
        at org.infinispan.interceptors.ReplicationInterceptor.visitPutKeyValueCommand(ReplicationInterceptor.java:79)
...

org.infinispan.CacheException: java.util.concurrent.TimeoutException: Timed out waiting for a cluster-wide sync to be released. (timeout = 240 seconds)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:145)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:156)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:265)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:252)
        at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:235)
        at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:228)
        at org.infinispan.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:116)
        at org.infinispan.interceptors.ReplicationInterceptor.visitPutKeyValueCommand(ReplicationInterceptor.java:79)
        at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:119)

2011-09-06 12:18:16,295 -- DEBUG - JGroupsDistSync - Failed to acquire exclusive processing lock. Read lock holders are [Thread[RetryQueueProcessor-sessions-cache@DFS1-27866,10,Thread Pools],
 Thread[RetryQueueProcessor-api-reports-cache@DFS1-27866,10,Thread Pools], Thread[Timer-2,5,main]]
2011-09-06 12:18:16,297 -- ERROR - JGroupsTransport - ISPN000095: Caught while responding to state transfer request
org.infinispan.statetransfer.StateTransferException: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-10,Infinispan-Cluster,DFS1-27866 could not obtain exclusive processing lock after 240 seconds.  Locks in question
 are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@19977e8[Read locks = 3] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@8f243d[Unlocked]
        at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:162)
        at org.infinispan.remoting.InboundInvocationHandlerImpl.generateState(InboundInvocationHandlerImpl.java:253)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.getState(JGroupsTransport.java:590)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:690)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
        at org.jgroups.JChannel.up(JChannel.java:1484)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderHandler.process(STREAMING_STATE_TRANSFER.java:651)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderThreadSpawner$1.run(STREAMING_STATE_TRANSFER.java:580)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-10,Infinispan-Cluster,DFS1-27866 could not obtain exclusive processing lock after 240 seconds.  Locks in question are java.util.concurrent.locks.ReentrantR
eadWriteLock$ReadLock@19977e8[Read locks = 3] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@8f243d[Unlocked]
        at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:104)
        at org.infinispan.statetransfer.StateTransferManagerImpl.generateTransactionLog(StateTransferManagerImpl.java:189)
        at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:152)
        ... 11 more
2011-09-06 12:18:16,303 -- DEBUG - STREAMING_STATE_TRANSFER - State writer is closing the socket

There was no load at all, all caches had maybe 5 objects in them

Version 4.2.1.FINAL works without any problems with at least 10 nodes in cluster

1. Re: StateTransferException with infinispan 5.0.0FINAL

galder.zamarreno Sep 29, 2011 4:14 AM (in response to shohou)

Hmm, we've revamped state transfer for 5.1. There's 5.1.0.ALPHA2 available with the new state transfer, but I'd probably wait for BETA1 which will be release on Monday/Tuesday to use it cos it's been further consolidated.
Actions
2. Re: StateTransferException with infinispan 5.0.0FINAL

shohou Sep 29, 2011 8:04 AM (in response to galder.zamarreno)

Nice, thanks. Will try BETA1 when it's out
Actions
3. Re: StateTransferException with infinispan 5.0.0FINAL

galder.zamarreno Oct 7, 2011 11:57 AM (in response to shohou)

It's out now
Actions
4. Re: StateTransferException with infinispan 5.0.0FINAL

shohou Oct 10, 2011 10:53 AM (in response to galder.zamarreno)

Tried with 5.1.0.BETA1. State transfer works. But after i started to get a lot of
TCP - logical address cache didn't contain all physical address, sending up a discovery request
on one node, and a lot of
TCP - S1-60192: no physical address for 204844ea-5a4e-286f-4b75-23dceb10b7a9, dropping message
RequestCorrelator - suspect=204844ea-5a4e-286f-4b75-23dceb10b7a9
on another node
Actions
5. Re: StateTransferException with infinispan 5.0.0FINAL

galder.zamarreno Oct 11, 2011 5:03 AM (in response to shohou)

Dmitry, dunno what's that about but seems to come from JGroups. Any logs available? If you can enable TRACE logging on org.jgroups it'd be even better.
Actions
6. Re: StateTransferException with infinispan 5.0.0FINAL

shohou Oct 11, 2011 7:28 AM (in response to galder.zamarreno)

It is somehow related to the jgroups configuration. I use TCPPING with two servers A and B in initial hosts. This servers see each other and I assume that one of them is always running. I have another two servers C and D which has the same initial hosts of TCPPING - A and B, so they can find each other only through A or B. With jgroups 2.12.1.Final this configuration worked. C and D was receiving information about each other from A or B and all cluster worked. Now C receives some numeric code (204844ea-5a4e-286f-4b75-23dceb10b7a9) instead of correct name of D and complains that it has no physical address. It sends discovery requests to A and B and are still not able to find the D. Same thing happens with D server about C.

If I put all the servers in TCPPING.initial_hosts then all works fine

Here is the logs with TRACE jgroups if you want to look - http://dl.dropbox.com/u/16787249/investigations.zip
Actions
7. Re: StateTransferException with infinispan 5.0.0FINAL

galder.zamarreno Oct 11, 2011 9:54 AM (in response to shohou)

It could be an issue with JGroups 3.0 which is the version used with latest Infinispan. I'll raise the issue with the JGroups guys.
Actions
8. Re: StateTransferException with infinispan 5.0.0FINAL

shohou Oct 17, 2011 10:19 AM (in response to galder.zamarreno)

I'd like to have the link to the issue if it's public
Actions
9. Re: StateTransferException with infinispan 5.0.0FINAL

belaban Oct 17, 2011 4:01 PM (in response to shohou)

Put *all* of the cluster servers in TCPPING.initial_hosts, as recommended by the documentation.

TCPPING is used for static clusters, in which the cluster membership is fixed and known beforehand. There are other, dynamic, discovery protocols, such as TCPGOSSIP, FILE_PING, MPING etc, to accommodate dynamic cluster membership.
Actions
10. Re: StateTransferException with infinispan 5.0.0FINAL

cbo_ Dec 19, 2011 3:50 PM (in response to belaban)

We have more recently been using 5.0.0.FINAL. Until last week it has been doing well for us until we hit the same issue described in this thread. I can see the suggestion was to go to a newer release candidate at the time of the discussion (about 2 months ago). I have a few questions/concerns I'd like to see if I can get feedback from Galder.

1) Do you know what the issue (which source file within the state transfer) that may have caused the exception?
2) As with Dmitry's case this occured for us when adding a second embedded cache into the cluster (we only have 2 though..I think he mentioned having 10 in the cluster).
3) It seems that once the exception occurs on the 2nd VM that the Scheduled-ReplicationQueueThread gets messed up on the coordinator. It gets an ISPN-000073 error with Unexpected error while replicating. Is this an indication that the cluster joiner has created the issue (even for the coordinator) or is it the coordinator that caused both?
4) Which release candidate should be reliable for us assuming we don't want to just take the the latest since that might cause a significant risk if introducing new bugs?

I have updated Joel Tosi on this issue as well in case you get feedback from that direction.

Thanks,
Craig
Actions
11. Re: StateTransferException with infinispan 5.0.0FINAL

dan.berindei Dec 20, 2011 3:33 AM (in response to cbo_)
craig bomba wrote:

We have more recently been using 5.0.0.FINAL. Until last week it has been doing well for us until we hit the same issue described in this thread. I can see the suggestion was to go to a newer release candidate at the time of the discussion (about 2 months ago). I have a few questions/concerns I'd like to see if I can get feedback from Galder.

1) Do you know what the issue (which source file within the state transfer) that may have caused the exception?

The StateTransferManager on the first node tried to acquire a cluster-wide lock (JGroupsDistSync) before writing the commit log at the end of state transfer, and it couldn't acquire it (maybe because of a deeadlock with other transactions running at the time on the first node).

We haven't investigated it too much because in 5.1.0 we're already using a different state transfer mechanism that doesn't acquire a cluster-wide lock.

2) As with Dmitry's case this occured for us when adding a second embedded cache into the cluster (we only have 2 though..I think he mentioned having 10 in the cluster).
3) It seems that once the exception occurs on the 2nd VM that the Scheduled-ReplicationQueueThread gets messed up on the coordinator. It gets an ISPN-000073 error with Unexpected error while replicating. Is this an indication that the cluster joiner has created the issue (even for the coordinator) or is it the coordinator that caused both?

It's most likely the coordinator, the joiner doesn't have any transactions running yet so it's unlikely to cause a deadlock in the cluster-wide lock.

4) Which release candidate should be reliable for us assuming we don't want to just take the the latest since that might cause a significant risk if introducing new bugs?

5.0.1.FINAL didn't change anything in the replicated-mode state transfer, so it should be the same as 5.0.0.FINAL for you.
5.1.0.BETA1 changed the state transfer to use the same mechanism as rehashing in 5.0 (ISPN-1194), and we've been making it more stable ever since. So I would advise going with 5.1.0.CR1.
Actions
12. Re: StateTransferException with infinispan 5.0.0FINAL

cbo_ Dec 20, 2011 9:00 AM (in response to dan.berindei)

I appreciate the quick reply.

In fact we found out (consistant with your description) that the 5.0.1.FINAL also has this issue.

At this point it will be risky to go forward to 5.1.0.CR1 for us. We will fall back to using 4.1.0.FINAL in the short term and move to 5.1 in early 2012.

Thanks again.
Actions
13. Re: StateTransferException with infinispan 5.0.0FINAL

douglas.ferguson Feb 17, 2012 11:22 PM (in response to cbo_)

I'm curious if anybody has found a work around.

We are experiencing the same issue.

We tried to update to 5.1 but we get classnodef errors because we are also using jgroups for ehache.
Actions
14. Re: StateTransferException with infinispan 5.0.0FINAL

galder.zamarreno Feb 21, 2012 2:40 AM (in response to douglas.ferguson)

What are you using EhCache for? Btw, since EhCache got acquired by Terracota, I don't think there's much work going into their JGroups compatibility layer and instead, if you wanna cluster EhCache, you need to use Terracota's own tech.
Actions

1 2 Previous Next

Go to original post