10 Replies Latest reply on Jun 27, 2012 6:21 AM by dan.berindei

    Why is infinispan blocking?

    ittayd

      I'm trying to use infinispan in REPL_SYNC with 2 nodes

       

      When trying to put values, the cache hangs doing a 'lock.wait(timeout)' in the stack trace below. I can't figure out why the lock is not notified. I have enabled JMX and it says there are 2 nodes in the cluster and the connections are as expected.

       

      Regards,

      Ittay

       

        at org.infinispan.statetransfer.StateTransferLockImpl.acquireLockForWriteCommand(StateTransferLockImpl.java:309)
        - locked <0x1d99> (a java.lang.Object)
        at org.infinispan.statetransfer.StateTransferLockImpl.acquireForCommand(StateTransferLockImpl.java:128)
        at org.infinispan.interceptors.StateTransferLockInterceptor.handleWriteCommand(StateTransferLockInterceptor.java:176)
        at org.infinispan.interceptors.StateTransferLockInterceptor.visitPutKeyValueCommand(StateTransferLockInterceptor.java:152)
        at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
        at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:125)
        at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
        at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
        at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:132)
        at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:91)
        at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:62)
        at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
        at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:345)
        at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1006)
        at org.infinispan.CacheImpl.put(CacheImpl.java:702)
        at org.infinispan.CacheImpl.put(CacheImpl.java:694)
        at org.infinispan.CacheSupport.put(CacheSupport.java:53)
        • 1. Re: Why is infinispan blocking?
          dan.berindei

          Ittay, even if the JGroups cluster has formed properly, for consistency reasons, we block write commands in the entire cluster until the existing owner has finished transferring any existing state to the joiner. The stack trace you posted suggests that the state transfer is still in progress; assuming you've just started both nodes, this is likely a problem either with your configuration or with Infinispan itself.

           

          If you can reproduce the problem, please enable TRACE logging for org.infinispan and attach it here (or even better, to a new issue in JIRA) along with a full thread dump and the Infinispan/JGroups configurations.

          • 2. Re: Why is infinispan blocking?
            ittayd

            When I look at all threads, I see no other infinispan thread. I assume there should be another thread doing the state transfer. Where is it started?

             

            My configuration is REPL_SYNC, NON_TRANSACTIONAL

            • 3. Re: Why is infinispan blocking?
              ittayd

              I see that the ReplicatedStateTransferTask is performed, but it actually does nothing as 'initialView' is true. After that, it is not performed again.

              • 4. Re: Why is infinispan blocking?
                dan.berindei

                Ittay Dror wrote:

                 

                When I look at all threads, I see no other infinispan thread. I assume there should be another thread doing the state transfer. Where is it started?

                On each node you should have at least a "CacheViewTrigger" thread and a "CacheViewInstaller" thread (if you have more than one cache you can have multiple CacheViewInstaller threads).

                These are high-level though, the actual state transfer happens in "async-transport" and in JGroups' "OOB" threads.

                 

                I see that the ReplicatedStateTransferTask is performed, but it actually does nothing as 'initialView' is true. After that, it is not performed again.

                Yes, the ReplicatedStateTransferTask doesn't do anything on the joiner, but still the joiner has to wait for data from the existing member and for a confirmation that it has finished pushing data. If you have TRACE logs enabled, you should see a "CacheViewControlCommand{...type=COMMIT_VIEW...}" line in the logs. The joiner will unlock the StateTransferLock only after it had received the COMMIT_VIEW command.

                • 5. Re: Why is infinispan blocking?
                  ittayd

                  Dan Berindei wrote:

                   

                  Ittay Dror wrote:

                   

                  When I look at all threads, I see no other infinispan thread. I assume there should be another thread doing the state transfer. Where is it started?

                  On each node you should have at least a "CacheViewTrigger" thread and a "CacheViewInstaller" thread (if you have more than one cache you can have multiple CacheViewInstaller threads).

                   

                  I have neither. Where are they started?

                   

                  These are high-level though, the actual state transfer happens in "async-transport" and in JGroups' "OOB" threads.

                   

                  I see that the ReplicatedStateTransferTask is performed, but it actually does nothing as 'initialView' is true. After that, it is not performed again.

                  Yes, the ReplicatedStateTransferTask doesn't do anything on the joiner, but still the joiner has to wait for data from the existing member and for a confirmation that it has finished pushing data. If you have TRACE logs enabled, you should see a "CacheViewControlCommand{...type=COMMIT_VIEW...}" line in the logs. The joiner will unlock the StateTransferLock only after it had received the COMMIT_VIEW command.

                  Thanks, I'll try to trace that.

                  • 6. Re: Why is infinispan blocking?
                    dan.berindei

                    Ittay Dror wrote:

                     

                    I have neither. Where are they started?

                     

                    The "CacheViewTrigger" is started by CacheViewsManagerImpl when the cache manager starts.

                    The "CacheViewInstaller" threads are started again by CacheViewsManagerImpl, but only on the coordinator, and only when a second node joins the cache view (i.e. starts the same cache).

                    • 7. Re: Why is infinispan blocking?
                      ittayd

                      So I found CacheViewTrigger. I also traed the commitView sequence and it seems to run (but only once, although I have 2 caches)

                       

                      However, when the ReplicatedStateTransferTask runs, it has 0 joiners and so does nothing.

                      • 8. Re: Why is infinispan blocking?
                        ittayd

                        attached the trace log

                        • 9. Re: Why is infinispan blocking?
                          ittayd

                          So I found the reason: in the coordinator node, objects were added to the cache that were not serializable. This caused view replication to fail. So the second node never got to the commit view phase and so writes to the second cache failed.

                           

                          I didn't look at the errors in the first node. So I missed this.

                           

                          I think it would be nice if infinispan would also produce errors in the other nodes when a view cannot be synced due to serialization issues in the first node.

                          • 10. Re: Why is infinispan blocking?
                            dan.berindei

                            I'm glad you fixed it, Ittay, even if it took so long.

                             

                            I'm pretty sure the joiner would eventually give you an exception as well (after stateTransfer.timeout expires). The first node, on the other hand, will keep trying to serialize the state and send it to the joiner, only incrementing the cache view id.

                             

                            It would be pretty hard to monitor for serialization exceptions on the first node (and after all, why only serialization exceptions?). But I think we could improve this by sending the root exception to all the nodes when we "roll back" the state transfer/cache view installation. I assume that would have helped in your case.