1 2 Previous Next 23 Replies Latest reply on Mar 9, 2012 5:00 AM by dan.berindei

    StateTransferException with infinispan 5.0.0FINAL

    shohou

      Hi!

           I'm getting StateTransferException with 5+ nodes in cluster on 3 different computers:

      This one is on starting instance:

       

      2011-09-02 10:11:20,822 - INFO  - GlobalComponentRegistry        - ISPN000128: Infinispan version: Infinispan 'Pagoa' 5.0.0.FINAL
      2011-09-02 10:11:21,083 - INFO  - RpcManagerImpl                 - ISPN000074: Trying to fetch state from SERVER-1-43204
      2011-09-02 10:11:23,093 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
      2011-09-02 10:11:53,122 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
      2011-09-02 10:11:53,241 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
      2011-09-02 10:12:23,265 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
      2011-09-02 10:12:23,282 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
      2011-09-02 10:12:53,306 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
      2011-09-02 10:12:53,324 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
      2011-09-02 10:13:53,392 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
      2011-09-02 10:13:53,406 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
      2011-09-02 10:14:23,432 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
      2011-09-02 10:14:23,448 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
      2011-09-02 10:14:53,474 - INFO  - InboundInvocationHandlerImpl   - ISPN000068: Cache named api-reports-cache does not exist on this cache manager!
      2011-09-02 10:14:53,486 - INFO  - InboundInvocationHandlerImpl   - ISPN000067: Will try and wait for cache named api-reports-cache to start
      2011-09-02 10:15:21,554 - ERROR - JGroupsTransport               - ISPN000096: Caught while requesting or applying state
      org.infinispan.statetransfer.StateTransferException: java.io.EOFException: The stream ended unexpectedly.  Please check whether the source of the stream encountered any issues generating the stream.
              at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:292)
              at org.infinispan.remoting.InboundInvocationHandlerImpl.applyState(InboundInvocationHandlerImpl.java:235)
              at org.infinispan.remoting.transport.jgroups.JGroupsTransport.setState(JGroupsTransport.java:607)
              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:711)
              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
              at org.jgroups.JChannel.up(JChannel.java:1441)
              at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074)
              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER.java:523)
              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER.java:462)
              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:223)
              at org.jgroups.protocols.FRAG2.up(FRAG2.java:189)
              at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
              at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
              at org.jgroups.protocols.pbcast.GMS.up(GMS.java:908)
              at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:246)
              at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:613)
              at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)
              at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:703)
              at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:133)
              at org.jgroups.protocols.FD.up(FD.java:275)
              at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:275)
              at org.jgroups.protocols.MERGE2.up(MERGE2.java:209)
              at org.jgroups.protocols.Discovery.up(Discovery.java:293)
              at org.jgroups.protocols.TP.passMessageUp(TP.java:1109)
              at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1665)
              at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1647)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      Caused by: java.io.EOFException: The stream ended unexpectedly.  Please check whether the source of the stream encountered any issues generating the stream.
              at org.infinispan.marshall.VersionAwareMarshaller.objectFromObjectStream(VersionAwareMarshaller.java:193)
              at org.infinispan.statetransfer.StateTransferManagerImpl.processCommitLog(StateTransferManagerImpl.java:215)
              at org.infinispan.statetransfer.StateTransferManagerImpl.applyTransactionLog(StateTransferManagerImpl.java:237)
              at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:281)
              ... 28 more
      Caused by: java.io.EOFException: Read past end of file
              at org.jboss.marshalling.SimpleDataInput.eofOnRead(SimpleDataInput.java:126)
              at org.jboss.marshalling.SimpleDataInput.readUnsignedByteDirect(SimpleDataInput.java:263)
              at org.jboss.marshalling.SimpleDataInput.readUnsignedByte(SimpleDataInput.java:224)
              at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:209)
              at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:37)
              at org.infinispan.marshall.jboss.GenericJBossMarshaller.objectFromObjectStream(GenericJBossMarshaller.java:194)
              at org.infinispan.marshall.VersionAwareMarshaller.objectFromObjectStream(VersionAwareMarshaller.java:191)
              ... 31 more
      

       

      And this are on the instance that provides state:

       

      2011-09-06 12:07:25,725 -- ERROR [::] - RpcManagerImpl - ISPN000073: Unexpected error while replicating
      java.util.concurrent.TimeoutException: Timed out waiting for a cluster-wide sync to be released. (timeout = 240 seconds)
              at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.blockUntilReleased(JGroupsDistSync.java:78)
              at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:436)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:132)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:156)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:265)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:252)
              at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:235)
              at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:228)
              at org.infinispan.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:116)
              at org.infinispan.interceptors.ReplicationInterceptor.visitPutKeyValueCommand(ReplicationInterceptor.java:79)
      ...
      
      

       

      org.infinispan.CacheException: java.util.concurrent.TimeoutException: Timed out waiting for a cluster-wide sync to be released. (timeout = 240 seconds)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:145)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:156)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:265)
              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:252)
              at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:235)
              at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:228)
              at org.infinispan.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:116)
              at org.infinispan.interceptors.ReplicationInterceptor.visitPutKeyValueCommand(ReplicationInterceptor.java:79)
              at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
              at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:119)
      
      

       

      2011-09-06 12:18:16,295 -- DEBUG - JGroupsDistSync - Failed to acquire exclusive processing lock. Read lock holders are [Thread[RetryQueueProcessor-sessions-cache@DFS1-27866,10,Thread Pools],
       Thread[RetryQueueProcessor-api-reports-cache@DFS1-27866,10,Thread Pools], Thread[Timer-2,5,main]]
      2011-09-06 12:18:16,297 -- ERROR - JGroupsTransport - ISPN000095: Caught while responding to state transfer request
      org.infinispan.statetransfer.StateTransferException: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-10,Infinispan-Cluster,DFS1-27866 could not obtain exclusive processing lock after 240 seconds.  Locks in question
       are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@19977e8[Read locks = 3] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@8f243d[Unlocked]
              at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:162)
              at org.infinispan.remoting.InboundInvocationHandlerImpl.generateState(InboundInvocationHandlerImpl.java:253)
              at org.infinispan.remoting.transport.jgroups.JGroupsTransport.getState(JGroupsTransport.java:590)
              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:690)
              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
              at org.jgroups.JChannel.up(JChannel.java:1484)
              at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074)
              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderHandler.process(STREAMING_STATE_TRANSFER.java:651)
              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER$StateProviderThreadSpawner$1.run(STREAMING_STATE_TRANSFER.java:580)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
              at java.lang.Thread.run(Thread.java:619)
      Caused by: java.util.concurrent.TimeoutException: STREAMING_STATE_TRANSFER-sender-10,Infinispan-Cluster,DFS1-27866 could not obtain exclusive processing lock after 240 seconds.  Locks in question are java.util.concurrent.locks.ReentrantR
      eadWriteLock$ReadLock@19977e8[Read locks = 3] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@8f243d[Unlocked]
              at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:104)
              at org.infinispan.statetransfer.StateTransferManagerImpl.generateTransactionLog(StateTransferManagerImpl.java:189)
              at org.infinispan.statetransfer.StateTransferManagerImpl.generateState(StateTransferManagerImpl.java:152)
              ... 11 more
      2011-09-06 12:18:16,303 -- DEBUG - STREAMING_STATE_TRANSFER - State writer is closing the socket
      
      

       

      There was no load at all, all caches had maybe 5 objects in them

      Version 4.2.1.FINAL works without any problems with at least 10 nodes in cluster

        • 1. Re: StateTransferException with infinispan 5.0.0FINAL
          galder.zamarreno

          Hmm, we've revamped state transfer for 5.1. There's 5.1.0.ALPHA2 available with the new state transfer, but I'd probably wait for BETA1 which will be release on Monday/Tuesday to use it cos it's been further consolidated.

          • 2. Re: StateTransferException with infinispan 5.0.0FINAL
            shohou

            Nice, thanks. Will try BETA1 when it's out

            • 3. Re: StateTransferException with infinispan 5.0.0FINAL
              galder.zamarreno

              It's out now

              • 4. Re: StateTransferException with infinispan 5.0.0FINAL
                shohou

                Tried with 5.1.0.BETA1. State transfer works. But after i started to get a lot of

                TCP - logical address cache didn't contain all physical address, sending up a discovery request

                on one node, and a lot of

                TCP - S1-60192: no physical address for 204844ea-5a4e-286f-4b75-23dceb10b7a9, dropping message

                RequestCorrelator - suspect=204844ea-5a4e-286f-4b75-23dceb10b7a9

                on another node

                • 5. Re: StateTransferException with infinispan 5.0.0FINAL
                  galder.zamarreno

                  Dmitry, dunno what's that about but seems to come from JGroups. Any logs available? If you can enable TRACE logging on org.jgroups it'd be even better.

                  • 6. Re: StateTransferException with infinispan 5.0.0FINAL
                    shohou

                    It is somehow related to the jgroups configuration. I use TCPPING with two servers A and B in initial hosts. This servers see each other and I assume that one of them is always running. I have another two servers C and D which has the same initial hosts of TCPPING - A and B, so they can find each other only through A or B. With jgroups 2.12.1.Final this configuration worked. C and D was receiving information about each other from A or B and all cluster worked. Now C receives some numeric code (204844ea-5a4e-286f-4b75-23dceb10b7a9) instead of correct name of D and complains that it has no physical address. It sends discovery requests to A and B and are still not able to find the D. Same thing happens with D server about C.

                     

                    If I put all the servers in TCPPING.initial_hosts then all works fine

                     

                    Here is the logs with TRACE jgroups if you want to look - http://dl.dropbox.com/u/16787249/investigations.zip

                    • 7. Re: StateTransferException with infinispan 5.0.0FINAL
                      galder.zamarreno

                      It could be an issue with JGroups 3.0 which is the version used with latest Infinispan. I'll raise the issue with the JGroups guys.

                      • 8. Re: StateTransferException with infinispan 5.0.0FINAL
                        shohou

                        I'd like to have the link to the issue if it's public

                        • 9. Re: StateTransferException with infinispan 5.0.0FINAL
                          belaban

                          Put *all* of the cluster servers in TCPPING.initial_hosts, as recommended by the documentation.

                           

                          TCPPING is used for static clusters, in which the cluster membership is fixed and known beforehand. There are other, dynamic, discovery protocols, such as TCPGOSSIP, FILE_PING, MPING etc, to accommodate dynamic cluster membership.

                          • 10. Re: StateTransferException with infinispan 5.0.0FINAL
                            cbo_

                            We have more recently been using 5.0.0.FINAL.  Until last week it has been doing well for us until we hit the same issue described in this thread.  I can see the suggestion was to go to a newer release candidate at the time of the discussion (about 2 months ago). I have a few questions/concerns I'd like to see if I can get feedback from Galder.

                             

                            1)  Do you know what the issue (which source file within the state transfer) that may have caused the exception?

                            2)  As with Dmitry's case this occured for us when adding a second embedded cache into the cluster (we only have 2 though..I think he mentioned having 10 in the cluster).

                            3)  It seems that once the exception occurs on the 2nd VM that the Scheduled-ReplicationQueueThread gets messed up on the coordinator.  It gets an ISPN-000073 error with Unexpected error while replicating.  Is this an indication that the cluster joiner has created the issue (even for the coordinator) or is it the coordinator that caused both? 

                            4)  Which release candidate should be reliable for us assuming we don't want to just take the the latest since that might cause a significant risk if introducing new bugs?

                             

                            I have updated Joel Tosi on this issue as well in case you get feedback from that direction.

                             

                             

                            Thanks,

                            Craig

                            • 11. Re: StateTransferException with infinispan 5.0.0FINAL
                              dan.berindei

                              craig bomba wrote:

                               

                              We have more recently been using 5.0.0.FINAL.  Until last week it has been doing well for us until we hit the same issue described in this thread.  I can see the suggestion was to go to a newer release candidate at the time of the discussion (about 2 months ago). I have a few questions/concerns I'd like to see if I can get feedback from Galder.

                               

                              1)  Do you know what the issue (which source file within the state transfer) that may have caused the exception?

                               

                               

                              The StateTransferManager on the first node tried to acquire a cluster-wide lock (JGroupsDistSync) before writing the commit log at the end of state transfer, and it couldn't acquire it (maybe because of a deeadlock with other transactions running at the time on the first node).

                               

                              We haven't investigated it too much because in 5.1.0 we're already using a different state transfer mechanism that doesn't acquire a cluster-wide lock.

                               

                              2)  As with Dmitry's case this occured for us when adding a second embedded cache into the cluster (we only have 2 though..I think he mentioned having 10 in the cluster).

                              3)  It seems that once the exception occurs on the 2nd VM that the Scheduled-ReplicationQueueThread gets messed up on the coordinator.  It gets an ISPN-000073 error with Unexpected error while replicating.  Is this an indication that the cluster joiner has created the issue (even for the coordinator) or is it the coordinator that caused both? 

                               

                              It's most likely the coordinator, the joiner doesn't have any transactions running yet so it's unlikely to cause a deadlock in the cluster-wide lock.

                               

                              4)  Which release candidate should be reliable for us assuming we don't want to just take the the latest since that might cause a significant risk if introducing new bugs?

                               

                              5.0.1.FINAL didn't change anything in the replicated-mode state transfer, so it should be the same as 5.0.0.FINAL for you.

                              5.1.0.BETA1 changed the state transfer to use the same mechanism as rehashing in 5.0 (ISPN-1194), and we've been making it more stable ever since. So I would advise going with 5.1.0.CR1.

                              • 12. Re: StateTransferException with infinispan 5.0.0FINAL
                                cbo_

                                I appreciate the quick reply. 

                                 

                                In fact we found out (consistant with your description) that the 5.0.1.FINAL also has this issue. 

                                 

                                At this point it will be risky to go forward to 5.1.0.CR1 for us.  We will fall back to using 4.1.0.FINAL in the short term and move to 5.1 in early 2012.

                                 

                                Thanks again.

                                • 13. Re: StateTransferException with infinispan 5.0.0FINAL
                                  douglas.ferguson

                                  I'm curious if anybody has found a work around.

                                   

                                  We are experiencing the same issue.

                                   

                                  We tried to update to 5.1 but we get classnodef errors because we are also using jgroups for ehache.

                                  • 14. Re: StateTransferException with infinispan 5.0.0FINAL
                                    galder.zamarreno

                                    What are you using EhCache for? Btw, since EhCache got acquired by Terracota, I don't think there's much work going into their JGroups compatibility layer and instead, if you wanna cluster EhCache, you need to use Terracota's own tech.

                                    1 2 Previous Next