7 Replies Latest reply on Aug 12, 2011 1:04 PM by joe.planisky

    Lock/Replication timeouts on 3 node EC2

    joe.planisky

      I'm trying to run a replicated cache on 3 nodes (and eventually more) using Infinispan 5.0.0.FINAL in the Amazon EC2 cloud and I'm running into intermittent TimeoutExceptions.  Most of the time, I get an "Unable to acquire lock after [10 seconds]...", but sometimes it's a "Replication timeout..." (see attached log file excerpt for details.)

       

      I've narrowed things down to a simple demo program (see attached file), the essence of which is this:

          EmbeddedCacheManager mgr = new DefaultCacheManager("testconfig.xml");
          Cache < String, String > cache = mgr.getCache("TestCache");
          try {
              cache.put("c", "start");
          } catch (Exception x ) {}
          while (true) {
              System.out.println("*************");
              System.out.println("Before update: " + cache.get("c"));
              String d = myIp +  " " + new Date().toString();
              try {
                  cache.put("c", d);
              } catch (Exception x) {}
              System.out.println(" After update: " + cache.get("c"));
              System.out.println("*************");
              Thread.sleep(1000);
          }
      

       

      I start this program on the 1st node and wait until it's up and running.  Then I start the 2nd node.  When the 2nd node is starting, both the 1st and 2nd nodes seem to freeze for about 15 seconds, but eventually they resume running and I see the expected console outputs on both.  When I start the 3rd node, again I see all nodes freeze for about 15 seconds, but usually everything recovers and I see the expected outputs on all 3 nodes.

       

      However, after a variable amount of time (a few seconds to a minute or more), I will see the nodes freeze again and after about 10 seconds I'll see the TimeoutExceptions on 2 of the nodes and the 3rd one will just continue where it paused. 

       

      In my jGroups configuration, I'm using TCP transport and S3_PING membership discovery.  (I've also used the FILE_PING discovery with the same results, so I don't think it's an S3 issue.). 

       

      The significant portion of my Infinispan configuration is:

      <namedCache name="TestCache">
        <deadlockDetection enabled="true"/>
        <unsafe unreliableReturnValues="false" />
        <locking concurrencyLevel="1000" useLockStriping="false" lockAcquisitionTimeout="10000" />
        <clustering mode="replication">
          <sync />
          <stateRetrieval fetchInMemoryState="true"/>
        </clustering>
      </namedCache>
      

       

       

      I've attached my complete Infinispan and jGroups configuration files.

       

      I'm using:

      • Infinispan 5.0.0.FINAL
      • Ubuntu 10.04 (kernel 2.6.32-308-ec2)
      • Java 1.6.0_20

       

      Do I have a configuration problem?  Am I not using Infinispan correctly? Any hints on how to fix or work around this issue?

       

      --

      Joe

        • 1. Re: Lock/Replication timeouts on 3 node EC2
          raulraja

          Joe, not sure if you have already seen this.

          http://community.jboss.org/message/613152

          I've been strugling to get inifinispan working on ec2 properly for weeks now. The current issue I have is the way in which the instances get initialized. If 1 or more instances are initialized in parallel the cluster does not join properly and state transfers fail. I'm trying to get this stable on AWS Elastic beanstalk wich autoscales up and down instances. I'm also seeing timeouts randomly poping up in my log even with trivial amounts of data when all nodes have joined the cluster. I'm not sure if this is related to ec2 networking issues.

          Also I'm having issues with EOF exceptions with the Lucene integration when using a DB as cache loader/storage.

          • 2. Re: Lock/Replication timeouts on 3 node EC2
            galder.zamarreno

            Joe, you seem to be trying to modify the same key from different nodes at the same time in a set up where you use synchronous replication. This means that in the current set up, each tries to apply the state and request other nodes to do so. The timeouts are likely due to concurrent attempts from different nodes to acquire the lock on the same key, i.e. Node A tries to acquire lock on "c" on Node B and viceversa. For 5.1 we're currently working on new locking techniques to avoid this type issues, for example coordinating the lock acquisition from one of the nodes in the cluster rather than letting all nodes acquire locks which can lead to deadlocks.

            1 of 1 people found this helpful
            • 3. Re: Lock/Replication timeouts on 3 node EC2
              mircea.markus

              I subscribe to galder's opinion. This will no longer be the case with 5.1, when we'll have single lock owner in place :ISPN-1137. This will be the case for distributed caches, but I think I can extend it for replicated caches as well.

              • 4. Re: Lock/Replication timeouts on 3 node EC2
                joe.planisky

                Hi Raul,

                 

                Thanks for the link to that thread about issues with EC2.  I had not seen it. 

                 

                However, since I posted my question, we have verified the same timeout behavior on a cluster of local machines using TCPPING.  That suggests this isn't necessarily an EC2 problem.

                 

                --

                Joe

                • 5. Re: Lock/Replication timeouts on 3 node EC2
                  joe.planisky

                  Thank you for your reply, Galder.

                   

                  I am indeed trying to update the same key from 2 different nodes at the same time.  In my real application, this is a distinct possibility.  Although not as likely as in my contrived demo program, it is happening often enough to be a real concern.

                   

                  That said, (and with me being completely unfamiliar with how Infinispan works under the covers,)  I'm surprised that this can lead to a deadlock situation.  My naive expectation would be that two nodes would request a lock on the key, one node would get the lock, do it's update, then release the lock so the other node could get the lock and do its work.  Is it an interaction between the simultaneous update attempts and synchronous replication that is causing the deadlock?

                   

                  Is there any way to avoid this deadlock short of guaranteeing that multiple nodes won't try to update the same key at the same time? 

                   

                  --

                  Joe

                  • 6. Re: Lock/Replication timeouts on 3 node EC2
                    galder.zamarreno

                    Yeah, it's the simultaneous request for the lock that crosses the network that can cause the issue. To avoid issues, I'd suggest using eager locking that forces locks to be acquired cluster wide before the operation is carried out (see https://docs.jboss.org/author/display/ISPN/Locking+and+Concurrency#LockingandConcurrency-Explicitandimplicitdistributedeagerlocking). If you're gonna have values that are gonna be updated from several nodes at the same time, it might be a good idea to use eager locking for them.

                    • 7. Re: Lock/Replication timeouts on 3 node EC2
                      joe.planisky

                      Thanks, Galder.  I'll give that a try.

                       

                      --

                      Joe