3 Replies Latest reply on Jun 23, 2011 11:14 AM by manik

    TopologyAwareConsistentHash / DefaultConsistentHash & LEAVE Rehashing 4.2.1.FINAL

    markaddy

      Hi

       

      Recently I configured 4.2.1.FINAL in DIST mode with the topology parameters for machine, rack and site ID.  Here is the configuration file:

       

      {code:xml}

      <?xml version="1.0" encoding="UTF-8"?>

       

      <infinispan

            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

            xsi:schemaLocation="urn:infinispan:config:4.2 http://www.infinispan.org/schemas/infinispan-config-4.2.xsd"

            xmlns="urn:infinispan:config:4.2">

       

         <global>

            <globalJmxStatistics

                  enabled="true"

                  jmxDomain="org.infinispan"

                  cacheManagerName="MYCacheManager"/>

            <transport

                  clusterName="my-infinispan-cluster"

                  machineId="${infinispan.machine:MACHINE1}"

                  siteId="${infinispan.site:SITE1}"

                  rackId="${infinispan.rack:RACK1}"

                  nodeName="${infinispan.node:NODE1}">

               <properties>

                  <property name="configurationFile" value="my-jgroups-udp.xml" />

               </properties>

            </transport>

         </global>

       

         <default>

            <clustering mode="distribution">

               <l1 enabled="false" />

               <hash numOwners="2"/>

               <async/>

            </clustering>

            <locking

               isolationLevel="READ_COMMITTED"

               lockAcquisitionTimeout="20000"

               writeSkewCheck="false"

               concurrencyLevel="5000"

               useLockStriping="false"

            />

            <jmxStatistics enabled="true"/>

         </default>

       

      </infinispan>

      {code}

       

      The following steps were then performed:

       

      Start 4 nodes

      Kill 1 node

      Attempt to put entries into the cache

       

      In this example NODE2 was removed, with TRACE turned on all the remaining nodes report that the rehash has completed successfully:

       

      infinispan-NODE1.log:2011-06-19 08:18:53,101 INFO  [org.infinispan.distribution.InvertedLeaveTask] (Rehasher-NODE1-27635) Completed leave rehash on node NODE1-27635 in 114 milliseconds - leavers now are []

      infinispan-NODE3.log:2011-06-19 08:18:53,109 INFO  [org.infinispan.distribution.InvertedLeaveTask] (Rehasher-NODE3-28994) Completed leave rehash on node NODE3-28994 in 88 milliseconds - leavers now are []

      infinispan-NODE4.log:2011-06-19 08:18:53,110 INFO  [org.infinispan.distribution.InvertedLeaveTask] (Rehasher-NODE4-57011) Completed leave rehash on node NODE4-57011 in 91 milliseconds - leavers now are []

       

      However when attempting to place further entries in the cache the the following stacktrace is seen:

       

      2011-06-18 13:47:03,740 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-14,NODE1-41654) Received response: ExceptionResponse from NODE3-20069

      2011-06-18 13:47:03,743 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (TP-Processor2) Responses: [sender=NODE3-20069, retval=ExceptionResponse, received=true, suspected=false]

      2011-06-18 13:47:03,743 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (TP-Processor2) replication exception:

      org.infinispan.CacheException: No such address ( NODE2-47967) in the list of caches: {NODE1-41654=NodeTopologyInfo{machineId='MACHINE1', rackId='RACK1', siteId='SITE1', address=NODE1-41654}, NODE4-36214=NodeTopologyInfo{machineId='MACHINE1', rackId='RACK1', siteId='SITE1', address=NODE4-36214}, NODE3-20069=NodeTopologyInfo{machineId='MACHINE1', rackId='RACK1', siteId='SITE1', address=NODE3-20069}}

              at org.infinispan.distribution.ch.TopologyInfo.isSameSite(TopologyInfo.java:42)

              at org.infinispan.distribution.ch.TopologyAwareConsistentHash.getOwners(TopologyAwareConsistentHash.java:102)

              at org.infinispan.distribution.ch.TopologyAwareConsistentHash.locate(TopologyAwareConsistentHash.java:55)

              at org.infinispan.distribution.DistributionManagerImpl.isAffectedByRehash(DistributionManagerImpl.java:436)

              at org.infinispan.commands.remote.ClusteredGetCommand.perform(ClusteredGetCommand.java:117)

              at org.infinispan.commands.remote.ClusteredGetCommand.perform(ClusteredGetCommand.java:56)

              at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:142)

              at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithWaitForBlocks(InboundInvocationHandlerImpl.java:156)

              at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithRetry(InboundInvocationHandlerImpl.java:246)

              at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:129)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:159)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:144)

       

      After stepping through the code cache.put first initiates a get to return the previous value.  If the key is not located locally a remote get is triggered but the node receiving the ClusteredGetCommand still perceives that a rehash is enabled as TransactionLoggerImpl.isEnabled() returns true and oldConsistentHash still contains the old view of the cluster before the LEAVE event: DistributionManagerImpl.isAffectedByRehash(key)

       

      {code:java}

      @ManagedOperation(description = "Determines whether a given key is affected by an ongoing rehash, if any.")

         @Operation(displayName = "Could key be affected by rehash?")

         public boolean isAffectedByRehash(@Parameter(name = "key", description = "Key to check") Object key) {

            return transactionLogger.isEnabled() && oldConsistentHash != null && !oldConsistentHash.locate(key, getReplCount()).contains(self);

         }

      {code}

        

      I think the problem lies at the end of rehashing where InvertedLeaveTask attempts to unlock the TransactionLogger but instead throws a NPE as the parameter lockedFor is null:

       

      {code:java}

         public void unlockAndDisable(Address lockedFor) {

            boolean unlock = true;

            try {

               if (!lockedFor.equals(writeLockOwner)) {

                  unlock = false;

                  throw new IllegalMonitorStateException("Compare-and-set for owner " + lockedFor + " failed - was " + writeLockOwner);

               }

       

               enabled = false;

               uncommittedPrepares.clear();

               writeLockOwner = null;

            } catch (IllegalMonitorStateException imse) {

               log.warn("Unable to stop transaction logging!", imse);

            } finally {

               if (unlock) modsLatch.open();

            }

         }

      {code}

       

      The stacktrace error is only generated when the topology parameters are set in the infinispan configuration file.  Without topology parameters (when the DefaultConsistentHash is used) the same problem exists, TransactionLogger remains enabled.  Apologies if this is already a known problem but I was unable to find an exact match in the issues logs - https://issues.jboss.org/browse/ISPN-1100 may be closely related though.