1 2 Previous Next 23 Replies Latest reply: Mar 9, 2012 5:00 AM by Dan Berindei RSS

StateTransferException with infinispan 5.0.0FINAL

Dmitry Shohov Newbie

Hi!

     I'm getting StateTransferException with 5+ nodes in cluster on 3 different computers:

This one is on starting instance:

 

2011-09-02 10:11:20,822 - INFO  - GlobalComponentRegistry        - ISPN000128: Infinispan version: Infinispan 'Pagoa' 5.0.0.FINAL
2011-09-02 10:11:21,083 - INFO  - RpcManagerImpl                 - ISPN000074: Trying to fetch state from SERVER-1-43204
2011-09-02 10:11:23,093 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:11:53,122 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:11:53,241 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:12:23,265 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:12:23,282 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:12:53,306 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:12:53,324 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:13:53,392 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:13:53,406 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:14:23,432 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:14:23,448 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:14:53,474 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
2011-09-02 10:14:53,486 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
2011-09-02 10:15:21,554 - ERROR - JGroupsTransport               - ISPN000096: Caught while requesting or applying state
org.infinispan.statetransfer.StateTransferException: java.io.EOFException: The stream ended unexpectedly.  Please check whether the source of the stream encountered any issues generating the stream.
        at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:292)
        at org.infinispan.remoting.InboundInvocationHandlerImpl.applyState(InboundInvocationHandlerImpl.java:235)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.setState(JGroupsTransport.java:607)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:711)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
        at org.jgroups.JChannel.up(JChannel.java:1441)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER.java:523)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER.java:462)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:223)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:189)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:908)
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:246)
        at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:613)
        at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)
        at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:703)
        at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:133)
        at org.jgroups.protocols.FD.up(FD.java:275)
        at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:275)
        at org.jgroups.protocols.MERGE2.up(MERGE2.java:209)
        at org.jgroups.protocols.Discovery.up(Discovery.java:293)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1109)
        at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1665)
        at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1647)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.EOFException: The stream ended unexpectedly.  Please check whether the source of the stream encountered any issues generating the stream.
        at org.infinispan.marshall.VersionAwareMarshaller.objectFromObjectStream(VersionAwareMarshaller.java:193)
        at org.infinispan.statetransfer.StateTransferManagerImpl.processCommitLog(StateTransferManagerImpl.java:215)
        at org.infinispan.statetransfer.StateTransferManagerImpl.applyTransactionLog(StateTransferManagerImpl.java:237)
        at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:281)
        ... 28 more
Caused by: java.io.EOFException: Read past end of file
        at org.jboss.marshalling.SimpleDataInput.eofOnRead(SimpleDataInput.java:126)
        at org.jboss.marshalling.SimpleDataInput.readUnsignedByteDirect(SimpleDataInput.java:263)
        at org.jboss.marshalling.SimpleDataInput.readUnsignedByte(SimpleDataInput.java:224)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:209)
        at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:37)
        at org.infinispan.marshall.jboss.GenericJBossMarshaller.objectFromObjectStream(GenericJBossMarshaller.java:194)
        at org.infinispan.marshall.VersionAwareMarshaller.objectFromObjectStream(VersionAwareMarshaller.java:191)
        ... 31 more

 

And this are on the instance that provides state:

 

2011-09-06 12:07:25,725 -- ERROR [::] - RpcManagerImpl - ISPN000073: Unexpected error while replicating
java.util.concurrent.TimeoutException: Timed out waiting for a cluster-wide sync to be released. (timeout = 240 seconds)
        at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.blockUntilReleased(JGroupsDistSync.java:78)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:436)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:132)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:156)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:265)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:252)
        at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:235)
        at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:228)
        at org.infinispan.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:116)
        at org.infinispan.interceptors.ReplicationInterceptor.visitPutKeyValueCommand(ReplicationInterceptor.java:79)
...

 

org.infinispan.CacheException: java.util.concurrent.TimeoutException: Timed out waiting for a cluster-wide sync to be released. (timeout = 240 seconds)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:145)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:156)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:265)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:252)
        at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:235)
        at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:228)
        at org.infinispan.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:116)
        at org.infinispan.interceptors.ReplicationInterceptor.visitPutKeyValueCommand(ReplicationInterceptor.java:79)
        at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:119)

 

2011-09-06 12:18:16,295 -- DEBUG - JGroupsDistSync - Failed to acquire exclusive processing lock. Read lock holders are [Thread[RetryQueueProcessor-sessions-cache@DFS1-27866,10,Thread Pools],
 Thread[RetryQueueProcessor-api-reports-cache@DFS1-27866,10,Thread Pools], Thread[Timer-2,5,main]]
2011-09-06 12:18:16,297 -- ERROR - JGroupsTransport - ISPN000095: Caught while responding to state transfer request
org.infinispan.statetransfer.StateTransferException: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-10,Infinispan-Cluster,DFS1-27866 could not obtain exclusive processing lock after 240 seconds.  Locks in question
 are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@19977e8[Read locks = 3] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@8f243d[Unlocked]
        at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:162)
        at org.infinispan.remoting.InboundInvocationHandlerImpl.generateState(InboundInvocationHandlerImpl.java:253)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.getState(JGroupsTransport.java:590)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:690)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
        at org.jgroups.JChannel.up(JChannel.java:1484)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderHandler.process(STREAMING_STATE_TRANSFER.java:651)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderThreadSpawner$1.run(STREAMING_STATE_TRANSFER.java:580)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-10,Infinispan-Cluster,DFS1-27866 could not obtain exclusive processing lock after 240 seconds.  Locks in question are java.util.concurrent.locks.ReentrantR
eadWriteLock$ReadLock@19977e8[Read locks = 3] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@8f243d[Unlocked]
        at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:104)
        at org.infinispan.statetransfer.StateTransferManagerImpl.generateTransactionLog(StateTransferManagerImpl.java:189)
        at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:152)
        ... 11 more
2011-09-06 12:18:16,303 -- DEBUG - STREAMING_STATE_TRANSFER - State writer is closing the socket

 

There was no load at all, all caches had maybe 5 objects in them

Version 4.2.1.FINAL works without any problems with at least 10 nodes in cluster

  • 1. Re: StateTransferException with infinispan 5.0.0FINAL
    Galder Zamarreño Master

    Hmm, we've revamped state transfer for 5.1. There's 5.1.0.ALPHA2 available with the new state transfer, but I'd probably wait for BETA1 which will be release on Monday/Tuesday to use it cos it's been further consolidated.

  • 2. Re: StateTransferException with infinispan 5.0.0FINAL
    Dmitry Shohov Newbie

    Nice, thanks. Will try BETA1 when it's out

  • 4. Re: StateTransferException with infinispan 5.0.0FINAL
    Dmitry Shohov Newbie

    Tried with 5.1.0.BETA1. State transfer works. But after i started to get a lot of

    TCP - logical address cache didn't contain all physical address, sending up a discovery request

    on one node, and a lot of

    TCP - S1-60192: no physical address for 204844ea-5a4e-286f-4b75-23dceb10b7a9, dropping message

    RequestCorrelator - suspect=204844ea-5a4e-286f-4b75-23dceb10b7a9

    on another node

  • 5. Re: StateTransferException with infinispan 5.0.0FINAL
    Galder Zamarreño Master

    Dmitry, dunno what's that about but seems to come from JGroups. Any logs available? If you can enable TRACE logging on org.jgroups it'd be even better.

  • 6. Re: StateTransferException with infinispan 5.0.0FINAL
    Dmitry Shohov Newbie

    It is somehow related to the jgroups configuration. I use TCPPING with two servers A and B in initial hosts. This servers see each other and I assume that one of them is always running. I have another two servers C and D which has the same initial hosts of TCPPING - A and B, so they can find each other only through A or B. With jgroups 2.12.1.Final this configuration worked. C and D was receiving information about each other from A or B and all cluster worked. Now C receives some numeric code (204844ea-5a4e-286f-4b75-23dceb10b7a9) instead of correct name of D and complains that it has no physical address. It sends discovery requests to A and B and are still not able to find the D. Same thing happens with D server about C.

     

    If I put all the servers in TCPPING.initial_hosts then all works fine

     

    Here is the logs with TRACE jgroups if you want to look - http://dl.dropbox.com/u/16787249/investigations.zip

  • 7. Re: StateTransferException with infinispan 5.0.0FINAL
    Galder Zamarreño Master

    It could be an issue with JGroups 3.0 which is the version used with latest Infinispan. I'll raise the issue with the JGroups guys.

  • 8. Re: StateTransferException with infinispan 5.0.0FINAL
    Dmitry Shohov Newbie

    I'd like to have the link to the issue if it's public

  • 9. Re: StateTransferException with infinispan 5.0.0FINAL
    Bela Ban Master

    Put *all* of the cluster servers in TCPPING.initial_hosts, as recommended by the documentation.

     

    TCPPING is used for static clusters, in which the cluster membership is fixed and known beforehand. There are other, dynamic, discovery protocols, such as TCPGOSSIP, FILE_PING, MPING etc, to accommodate dynamic cluster membership.

  • 10. Re: StateTransferException with infinispan 5.0.0FINAL
    craig bomba Novice

    We have more recently been using 5.0.0.FINAL.  Until last week it has been doing well for us until we hit the same issue described in this thread.  I can see the suggestion was to go to a newer release candidate at the time of the discussion (about 2 months ago). I have a few questions/concerns I'd like to see if I can get feedback from Galder.

     

    1)  Do you know what the issue (which source file within the state transfer) that may have caused the exception?

    2)  As with Dmitry's case this occured for us when adding a second embedded cache into the cluster (we only have 2 though..I think he mentioned having 10 in the cluster).

    3)  It seems that once the exception occurs on the 2nd VM that the Scheduled-ReplicationQueueThread gets messed up on the coordinator.  It gets an ISPN-000073 error with Unexpected error while replicating.  Is this an indication that the cluster joiner has created the issue (even for the coordinator) or is it the coordinator that caused both? 

    4)  Which release candidate should be reliable for us assuming we don't want to just take the the latest since that might cause a significant risk if introducing new bugs?

     

    I have updated Joel Tosi on this issue as well in case you get feedback from that direction.

     

     

    Thanks,

    Craig

  • 11. Re: StateTransferException with infinispan 5.0.0FINAL
    Dan Berindei Apprentice

    craig bomba wrote:

     

    We have more recently been using 5.0.0.FINAL.  Until last week it has been doing well for us until we hit the same issue described in this thread.  I can see the suggestion was to go to a newer release candidate at the time of the discussion (about 2 months ago). I have a few questions/concerns I'd like to see if I can get feedback from Galder.

     

    1)  Do you know what the issue (which source file within the state transfer) that may have caused the exception?

     

     

    The StateTransferManager on the first node tried to acquire a cluster-wide lock (JGroupsDistSync) before writing the commit log at the end of state transfer, and it couldn't acquire it (maybe because of a deeadlock with other transactions running at the time on the first node).

     

    We haven't investigated it too much because in 5.1.0 we're already using a different state transfer mechanism that doesn't acquire a cluster-wide lock.

     

    2)  As with Dmitry's case this occured for us when adding a second embedded cache into the cluster (we only have 2 though..I think he mentioned having 10 in the cluster).

    3)  It seems that once the exception occurs on the 2nd VM that the Scheduled-ReplicationQueueThread gets messed up on the coordinator.  It gets an ISPN-000073 error with Unexpected error while replicating.  Is this an indication that the cluster joiner has created the issue (even for the coordinator) or is it the coordinator that caused both? 

     

    It's most likely the coordinator, the joiner doesn't have any transactions running yet so it's unlikely to cause a deadlock in the cluster-wide lock.

     

    4)  Which release candidate should be reliable for us assuming we don't want to just take the the latest since that might cause a significant risk if introducing new bugs?

     

    5.0.1.FINAL didn't change anything in the replicated-mode state transfer, so it should be the same as 5.0.0.FINAL for you.

    5.1.0.BETA1 changed the state transfer to use the same mechanism as rehashing in 5.0 (ISPN-1194), and we've been making it more stable ever since. So I would advise going with 5.1.0.CR1.

  • 12. Re: StateTransferException with infinispan 5.0.0FINAL
    craig bomba Novice

    I appreciate the quick reply. 

     

    In fact we found out (consistant with your description) that the 5.0.1.FINAL also has this issue. 

     

    At this point it will be risky to go forward to 5.1.0.CR1 for us.  We will fall back to using 4.1.0.FINAL in the short term and move to 5.1 in early 2012.

     

    Thanks again.

  • 13. Re: StateTransferException with infinispan 5.0.0FINAL
    Douglas Ferguson Newbie

    I'm curious if anybody has found a work around.

     

    We are experiencing the same issue.

     

    We tried to update to 5.1 but we get classnodef errors because we are also using jgroups for ehache.

  • 14. Re: StateTransferException with infinispan 5.0.0FINAL
    Galder Zamarreño Master

    What are you using EhCache for? Btw, since EhCache got acquired by Terracota, I don't think there's much work going into their JGroups compatibility layer and instead, if you wanna cluster EhCache, you need to use Terracota's own tech.

1 2 Previous Next