3 Replies Latest reply on Jun 23, 2011 11:14 AM by manik

TopologyAwareConsistentHash / DefaultConsistentHash & LEAVE Rehashing 4.2.1.FINAL

markaddy Jun 19, 2011 9:57 AM

Recently I configured 4.2.1.FINAL in DIST mode with the topology parameters for machine, rack and site ID. Here is the configuration file:

{code:xml}
<?xml version="1.0" encoding="UTF-8"?>
 
<infinispan
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="urn:infinispan:config:4.2 http://www.infinispan.org/schemas/infinispan-config-4.2.xsd"
      xmlns="urn:infinispan:config:4.2">
 
   <global>
      <globalJmxStatistics
            enabled="true"
            jmxDomain="org.infinispan"
            cacheManagerName="MYCacheManager"/>
      <transport
            clusterName="my-infinispan-cluster"
            machineId="${infinispan.machine:MACHINE1}"
            siteId="${infinispan.site:SITE1}"
            rackId="${infinispan.rack:RACK1}"
            nodeName="${infinispan.node:NODE1}">
         <properties>
            <property name="configurationFile" value="my-jgroups-udp.xml" />
         </properties>
      </transport>
   </global>
 
   <default>
      <clustering mode="distribution">
         <l1 enabled="false" />
         <hash numOwners="2"/>
         <async/>
      </clustering>
      <locking
         isolationLevel="READ_COMMITTED"
         lockAcquisitionTimeout="20000"
         writeSkewCheck="false"
         concurrencyLevel="5000"
         useLockStriping="false"
      />
      <jmxStatistics enabled="true"/>
   </default>
 
</infinispan>
{code}

The following steps were then performed:

Start 4 nodes

Kill 1 node

Attempt to put entries into the cache

In this example NODE2 was removed, with TRACE turned on all the remaining nodes report that the rehash has completed successfully:

infinispan-NODE1.log:2011-06-19 08:18:53,101 INFO [org.infinispan.distribution.InvertedLeaveTask] (Rehasher-NODE1-27635) Completed leave rehash on node NODE1-27635 in 114 milliseconds - leavers now are []

infinispan-NODE3.log:2011-06-19 08:18:53,109 INFO [org.infinispan.distribution.InvertedLeaveTask] (Rehasher-NODE3-28994) Completed leave rehash on node NODE3-28994 in 88 milliseconds - leavers now are []

infinispan-NODE4.log:2011-06-19 08:18:53,110 INFO [org.infinispan.distribution.InvertedLeaveTask] (Rehasher-NODE4-57011) Completed leave rehash on node NODE4-57011 in 91 milliseconds - leavers now are []

However when attempting to place further entries in the cache the the following stacktrace is seen:

2011-06-18 13:47:03,740 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-14,NODE1-41654) Received response: ExceptionResponse from NODE3-20069

2011-06-18 13:47:03,743 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (TP-Processor2) Responses: [sender=NODE3-20069, retval=ExceptionResponse, received=true, suspected=false]

2011-06-18 13:47:03,743 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (TP-Processor2) replication exception:

org.infinispan.CacheException: No such address ( NODE2-47967) in the list of caches: {NODE1-41654=NodeTopologyInfo{machineId='MACHINE1', rackId='RACK1', siteId='SITE1', address=NODE1-41654}, NODE4-36214=NodeTopologyInfo{machineId='MACHINE1', rackId='RACK1', siteId='SITE1', address=NODE4-36214}, NODE3-20069=NodeTopologyInfo{machineId='MACHINE1', rackId='RACK1', siteId='SITE1', address=NODE3-20069}}

at org.infinispan.distribution.ch.TopologyInfo.isSameSite(TopologyInfo.java:42)

at org.infinispan.distribution.ch.TopologyAwareConsistentHash.getOwners(TopologyAwareConsistentHash.java:102)

at org.infinispan.distribution.ch.TopologyAwareConsistentHash.locate(TopologyAwareConsistentHash.java:55)

at org.infinispan.distribution.DistributionManagerImpl.isAffectedByRehash(DistributionManagerImpl.java:436)

at org.infinispan.commands.remote.ClusteredGetCommand.perform(ClusteredGetCommand.java:117)

at org.infinispan.commands.remote.ClusteredGetCommand.perform(ClusteredGetCommand.java:56)

at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:142)

at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithWaitForBlocks(InboundInvocationHandlerImpl.java:156)

at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithRetry(InboundInvocationHandlerImpl.java:246)

at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:129)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:159)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:144)

After stepping through the code cache.put first initiates a get to return the previous value. If the key is not located locally a remote get is triggered but the node receiving the ClusteredGetCommand still perceives that a rehash is enabled as TransactionLoggerImpl.isEnabled() returns true and oldConsistentHash still contains the old view of the cluster before the LEAVE event: DistributionManagerImpl.isAffectedByRehash(key)

{code:java}
@ManagedOperation(description = "Determines whether a given key is affected by an ongoing rehash, if any.")
   @Operation(displayName = "Could key be affected by rehash?")
   public boolean isAffectedByRehash(@Parameter(name = "key", description = "Key to check") Object key) {
      return transactionLogger.isEnabled() && oldConsistentHash != null && !oldConsistentHash.locate(key, getReplCount()).contains(self);
   }
{code}

I think the problem lies at the end of rehashing where InvertedLeaveTask attempts to unlock the TransactionLogger but instead throws a NPE as the parameter lockedFor is null:

{code:java}
   public void unlockAndDisable(Address lockedFor) {
      boolean unlock = true;
      try {
         if (!lockedFor.equals(writeLockOwner)) {
            unlock = false;
            throw new IllegalMonitorStateException("Compare-and-set for owner " + lockedFor + " failed - was " + writeLockOwner);
         }
 
         enabled = false;
         uncommittedPrepares.clear();
         writeLockOwner = null;
      } catch (IllegalMonitorStateException imse) {
         log.warn("Unable to stop transaction logging!", imse);
      } finally {
         if (unlock) modsLatch.open();
      }
   }
{code}

The stacktrace error is only generated when the topology parameters are set in the infinispan configuration file. Without topology parameters (when the DefaultConsistentHash is used) the same problem exists, TransactionLogger remains enabled. Apologies if this is already a known problem but I was unable to find an exact match in the issues logs - https://issues.jboss.org/browse/ISPN-1100 may be closely related though.

1. Re: TopologyAwareConsistentHash / DefaultConsistentHash & LEAVE Rehashing 4.2.1.FINAL

manik Jun 21, 2011 6:00 AM (in response to markaddy)

Have you tried this with 5.0.0.CR6?
Actions
2. Re: TopologyAwareConsistentHash / DefaultConsistentHash & LEAVE Rehashing 4.2.1.FINAL

markaddy Jun 21, 2011 6:38 AM (in response to manik)

Yes, just downloaded and it works fine, I can see the rehashing code has been reworked.
Will any fixes be retro-fitted back into 4.2.x as I see there is a pending 4.2.2 release or would you recommend moving to 5.0.x?
This customer doesn't have a use case for the additional 5.0.x features such as Executors or Map Reduce but does require distribution & rehashing to be working well.

Thanks.
Actions
3. Re: TopologyAwareConsistentHash / DefaultConsistentHash & LEAVE Rehashing 4.2.1.FINAL

manik Jun 23, 2011 11:14 AM (in response to markaddy)

I would recommend upgrading to 5.x. It is very unlikely that there will be any more 4.2.x releases.
Actions

Go to original post