1 2 3 … Previous Next 72 Replies Latest reply on Aug 15, 2008 11:28 AM by prabhat.jha

portal clustering problem when using optimistic locking

prabhat.jha Jun 9, 2008 2:02 PM

The problem is described at JBPORTAL-1879 and I have narrowed down it to configuration where this problem can be easily replicated.

It only happens when optimistic locking used with REPL_SYNC as cache_mode.

<attribute name="NodeLockingScheme">OPTIMISTIC</attribute>
<attribute name="IsolationLevel">REPEATABLE_READ</attribute>
<attribute name="CacheMode">REPL_SYNC</attribute>

Changing CacheMode to REPL_ASYNC or LOCAL as well as using Pessimistic locking does not result into this problem. I think we need to look into following possibilities besides looking into portal code base

1. Problem with org.hibernate.cache.OptimisticTreeCache or the way we are using it
2. Deadlock in org.jboss.cache.lock.ReadWriteLockWithUpgrade

1. Re: portal clustering problem when using optimistic locking

galder.zamarreno Jun 16, 2008 11:31 AM (in response to prabhat.jha)

Prabhat,

I'll have a look at this in the next couple of days but it's worth noting that with SYNC communication, specially replication, there's always a chance of a deadlock because locks are only acquired at commit time. So, individual nodes could successfully acquire nodes locally, and then cross each other when trying to acquire locks cluster wide. The end result is both nodes timing out in their operation:

Example:
Tx1 acquires WL on N1 for node /a/b
Tx2 acquires WL on N2 for node /a/b
Tx1 tries to commit and attempts to acquire lock for /a/b while Tx2 is trying to commit and attempt to lock that same node cluster wide.
Tx1 rollbacks because it can't acquire locks for /a/b in N2 and viceversa

As I said, I'll check the attached files in the next couple of days to confirm whether this is precisely what you're seeing. However, it's worth keeping it in mind.
Actions
2. Re: portal clustering problem when using optimistic locking

prabhat.jha Jun 16, 2008 11:33 AM (in response to prabhat.jha)

Thanks Galder. I will wait for your findings.
Actions
3. Re: portal clustering problem when using optimistic locking

galder.zamarreno Jun 17, 2008 7:44 AM (in response to prabhat.jha)
Prabhat,

It's true that pessimistic locking for 2nd level cache is currently our preferred configuration, hence it's the default for EJB3 2nd level cache in EAP/AS 4.2+ but I dunno the exact details on why PL was chosen over OL. I'm gonna forward the forum link to Brian who I think might be able to give us further information on why PL was chosen over OL.

In the mean time, I don't have enough info to determine the root cause of your issues and so I'd like you to follow this instructions, repeat the test and attach a zip file with the resulting files to the JIRA.

- Add thread logging information and enable TRACE for org.jboss.cache package. See [1]

- Generate thread dumps in both nodes when you're unable to access that page.

Once you've done this, could you try with the latest EAP 4.2.x CP which is EAP 4.2.0.GA_CP03? This is our latest release for customers which has newer bug fix releases for JBoss Cache and Hibernate amongst other things.

[1] Add to conf/jboss-log4j.xml: <category name="org.jboss.cache"> <priority value="TRACE" /> </category> Change ConversionPattern for server.log file appender to full pattern: <param name="ConversionPattern" value="%d %-5r %-5p [%c] (%t:%x) %m%n"/>
Actions
4. Re: portal clustering problem when using optimistic locking

prabhat.jha Jun 17, 2008 10:34 AM (in response to prabhat.jha)

Even in Portal, pessimistic locking was default configuration. Only after many testing, I found and confirmed that OL performs better under heavy load. PL is better when load is lighter.

You will soon have the logs that you need.
Actions
5. Re: portal clustering problem when using optimistic locking

brian.stansberry Jun 17, 2008 10:35 AM (in response to prabhat.jha)

PL was chosen over OL as the default as it has been around a lot longer, has a lot more usage in the wild, and is regarded as more stable/battle tested. Over the course of 2007 we identified and resolved a number of issues with OL, the latest of which is http://jira.jboss.org/jira/browse/JBCACHE-1228 which I suspect may be the source of the problem you were seeing on JBPORTAL-1879.
Actions
6. Re: portal clustering problem when using optimistic locking

prabhat.jha Jun 17, 2008 10:43 AM (in response to prabhat.jha)

Thanks Brian. I will upgrade the jboss cache library and run tests again.
Actions
7. Re: portal clustering problem when using optimistic locking

prabhat.jha Jun 18, 2008 8:36 PM (in response to prabhat.jha)

With jboss cache version 1.4.1.SP8 (http://repository.jboss.com/jboss/cache/1.4.1.SP8.patch02-brew), I do not have problem accessing a portal page. Hence this certainly fixes this problem. However, I see some errors with 2 and 3 nodes:

2008-06-18 19:16:00,439 ERROR [org.jboss.cache.TreeCache] Caught TimeoutException acquiring locks on region /
org.jboss.cache.lock.TimeoutException: failure acquiring lock: fqn=/_BUDDY_BACKUP_/10.18.1.131_34704/org/jboss/portal/core/impl/model/portal/PortalImpl/windowStates/org.jboss.portal.core.impl.model.portal.PortalImpl.windowStates#3112, caller=Thread[IncomingPacketHandler (channel=portal.hibernate),5,JGroups threads], lock=write owner=GlobalTransaction:<10.18.1.131:34704>:1759 (activeReaders=0, activeWriter=Thread[IncomingPacketHandler (channel=portal.hibernate),5,], waitingReaders=0, waitingWriters=0, waitingUpgrader=0)
at org.jboss.cache.Node.acquire(Node.java:500)
at org.jboss.cache.Node.acquireAll(Node.java:546)
at org.jboss.cache.Node.acquireAll(Node.java:555)
at org.jboss.cache.Node.acquireAll(Node.java:555)
at org.jboss.cache.Node.acquireAll(Node.java:555)
and

2008-06-18 14:49:39,603 ERROR [org.jgroups.protocols.UDP] failed unmarshalling message
java.lang.NullPointerException
at org.jgroups.protocols.FC.handleCredit(FC.java:591)
at org.jgroups.protocols.FC.up(FC.java:398)
at org.jgroups.stack.Protocol.receiveUpEvent(Protocol.java:470)
at org.jgroups.stack.Protocol.passUp(Protocol.java:520)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:768)
at org.jgroups.stack.Protocol.receiveUpEvent(Protocol.java:470)

Do I need to update jgroups library for this cache version? With 4-node cluster, I am getting very bad performance. I will look into that.
Actions
8. Re: portal clustering problem when using optimistic locking

brian.stansberry Jun 18, 2008 9:46 PM (in response to prabhat.jha)

What is being cached here? An entity? You can't use buddy replication for entity caching.

What JGroups version are you using? I'd recommend 2.4.2, which is what is in the latest EAP release (which is also what I'd recommend).
Actions
9. Re: portal clustering problem when using optimistic locking

galder.zamarreno Jun 19, 2008 3:20 AM (in response to prabhat.jha)

"bstansberry@jboss.com" wrote:
What JGroups version are you using? I'd recommend 2.4.2, which is what is in the latest EAP release (which is also what I'd recommend).

Indeed. The latest EAP CP releases, whether 4.2.0.GA_CP03 or 4.3.0.GA_CP01 contain up to date JBC and JGRP releases, see:

http://anonsvn.jboss.org/repos/jbossas/tags/JBPAPP_4_2_0_GA_CP03/build/build-thirdparty.xml
Actions
10. Re: portal clustering problem when using optimistic locking

prabhat.jha Jun 20, 2008 4:54 PM (in response to prabhat.jha)

This is getting interesting. I tried with JBPAPP_4_3_0_GA_CP01 in production configuration, I am seeing the same behavior as I was seeing before. If I access a node before other comes up, everything goes fine.

I have removed the buddy replication attribute from tree cache configuration as you mentioned. The complete tree cache configuration is attached to http://jira.jboss.com/jira/browse/JBPORTAL-1879.

The error I see on server side is:

No ManagedConnections available within configured blocking timeout ( 30000 [ms] ); - nested throwable: (javax.resource.ResourceException: No ManagedConnections available within configured blocking timeout ( 30000 [ms] ))

I will take thread dump on both nodes and share my findings.
Actions
11. Re: portal clustering problem when using optimistic locking

prabhat.jha Jun 20, 2008 6:05 PM (in response to prabhat.jha)

Thread dumpts attached to JBPORTAL-1879. With 5 users hitting one server, I primarily see

- waiting on <0x0000002b2dad5118> (a java.util.HashMap)
at org.jgroups.blocks.GroupRequest.doExecute(GroupRequest.java:502)
- locked <0x0000002b2dad5118> (a java.util.HashMap)
at org.jgroups.blocks.GroupRequest.execute(GroupRequest.java:193)
at org.jgroups.blocks.MessageDispatcher.castMessage(MessageDispatcher.java:432)
at org.jgroups.blocks.RpcDispatcher.callRemoteMethods(RpcDispatcher.java:192)
at org.jgroups.blocks.RpcDispatcher.callRemoteMethods(RpcDispatcher.java:163)
at org.jboss.cache.TreeCache.callRemoteMethodsViaReflection(TreeCache.java:4471)
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4418)
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4371)
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4489)
at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:110)
at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:88)
at org.jboss.cache.interceptors.OptimisticReplicationInterceptor.broadcastPrepare(OptimisticReplicationInterceptor.java:181)
at org.jboss.cache.interceptors.OptimisticReplicationInterceptor.invoke(OptimisticReplicationInterceptor.java:90)
at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
at org.jboss.cache.interceptors.TxInterceptor.runPreparePhase(TxInterceptor.java:887)
at org.jboss.cache.interceptors.TxInterceptor$LocalSynchronizationHandler.beforeCompletion(TxInterceptor.java:1170)
at org.jboss.cache.interceptors.OrderedSynchronizationHandler.beforeCompletion(OrderedSynchronizationHandler.java:75)
at com.arjuna.ats.internal.jta.resources.arjunacore.SynchronizationImple.beforeCompletion(SynchronizationImple.java:114)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.beforeCompletion(TwoPhaseCoordinator.java:247)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:86)
at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:177)
at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1382)
at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:135)
at com.arjuna.ats.jbossatx.BaseTransactionManagerDelegate.commit(BaseTransactionManagerDelegate.java:87)
at org.jboss.cache.interceptors.TxInterceptor.handleNonTxMethod(TxInterceptor.java:383)
at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:174)
at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:157)
at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5919)
at org.jboss.cache.TreeCache.put(TreeCache.java:3858)
at org.jboss.cache.TreeCache.put(TreeCache.java:3532)
at org.hibernate.cache.OptimisticTreeCache.writeLoad(OptimisticTreeCache.java:84)
at org.hibernate.cache.TransactionalCache.put(TransactionalCache.java:55)
Actions
12. Re: portal clustering problem when using optimistic locking

prabhat.jha Jun 20, 2008 6:06 PM (in response to prabhat.jha)

The way servers are started is:
java -server -Xms3072m -Xmx3072m -XX:MaxPermSize=128m -XX:+UseLargePages -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djgroups.udp.ip_ttl=1 -jar /tmp/pjha/eap43/bin/run.jar -c production -b 10.18.1.129 -Djboss.partition.name=PortalScalabilityTestPartition -Dcluster.route.name=perf11 -u 239.11.12.13
Actions

13. Re: portal clustering problem when using optimistic locking

brian.stansberry Jun 20, 2008 6:51 PM (in response to prabhat.jha)

You're seeing those because of this kind of thing on the ndoe and/or the other node:

"IncomingPacketHandler (channel=portal.hibernate)" daemon prio=1 tid=0x0000002b70482b30 nid=0x6d11 in Object.wait() [0x0000000042573000..0x0000000042575ab0]
 at java.lang.Object.wait(Native Method)
 - waiting on <0x0000002b2dc22680> (a org.jboss.cache.lock.ReadWriteLockWithUpgrade$WriterLock)
 at org.jboss.cache.lock.ReadWriteLockWithUpgrade$WriterLock.attempt(ReadWriteLockWithUpgrade.java:389)
 - locked <0x0000002b2dc22680> (a org.jboss.cache.lock.ReadWriteLockWithUpgrade$WriterLock)
 at org.jboss.cache.lock.IdentityLock.acquireWriteLock(IdentityLock.java:199)
 at org.jboss.cache.Node.acquireWriteLock(Node.java:562)
 at org.jboss.cache.Node.acquire(Node.java:509)
 at org.jboss.cache.interceptors.OptimisticLockingInterceptor.lockNodes(OptimisticLockingInterceptor.java:160)
 at org.jboss.cache.interceptors.OptimisticLockingInterceptor.invoke(OptimisticLockingInterceptor.java:84)
 at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
 at org.jboss.cache.interceptors.OptimisticReplicationInterceptor.invoke(OptimisticReplicationInterceptor.java:85)
 at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
 at org.jboss.cache.interceptors.TxInterceptor.handleOptimisticPrepare(TxInterceptor.java:452)
 at org.jboss.cache.interceptors.TxInterceptor.handleRemotePrepare(TxInterceptor.java:336)
 at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:145)
 at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
 at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:183)
 at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5919)
 at org.jboss.cache.TreeCache._replicate(TreeCache.java:5196)

That thread is the one that carries messages up from the network layer to JBoss Cache. As you can see, it's blocking waiting for a lock. While those threads block, messages don't get transmitted, leading to the blocking you see on the other threads.

The real problem here is the 2 JBC instances are fighting over a lock -- and unnecessarily so - they are both trying to cache the same data that's been directly read from the db. Logicaly, that needn't trigger any cluster traffic at all. The old Hibernate 3.2.x / JBC 1.x integration code is quite poor. Totally redone for Hibernate 3.3. But that doesn't help you. :(

How critical is this? I'm tempted to extract a version of the TreeCacheProvider out of the EAP code, improve it a bit and package it as an independent library. Will help with your issue, and be a small benefit to the community. Won't be as good as the Hibernate 3.3 stuff, but better than what you've got. But I really don't have the cycle to do it, not until AS 5 is done.

14. Re: portal clustering problem when using optimistic locking

prabhat.jha Jun 23, 2008 12:11 PM (in response to prabhat.jha)

Thanks Brian. Let me first discuss this with portal folks. I have two choice now:

1. Make pessimistic locking as default configuration.
2.Male OL with REPL_ASYNC as default configuration. I have not run into this problem with REPL_ASYNC so far. OL performs much better under heavy load though.

Do you have any preference?
Actions

1 2 3 … Previous Next

Go to original post