The problem is described at JBPORTAL-1879 and I have narrowed down it to configuration where this problem can be easily replicated.
It only happens when optimistic locking used with REPL_SYNC as cache_mode.
<attribute name="NodeLockingScheme">OPTIMISTIC</attribute> <attribute name="IsolationLevel">REPEATABLE_READ</attribute> <attribute name="CacheMode">REPL_SYNC</attribute>
Changing CacheMode to REPL_ASYNC or LOCAL as well as using Pessimistic locking does not result into this problem. I think we need to look into following possibilities besides looking into portal code base
1. Problem with org.hibernate.cache.OptimisticTreeCache or the way we are using it
2. Deadlock in org.jboss.cache.lock.ReadWriteLockWithUpgrade
I'll have a look at this in the next couple of days but it's worth noting that with SYNC communication, specially replication, there's always a chance of a deadlock because locks are only acquired at commit time. So, individual nodes could successfully acquire nodes locally, and then cross each other when trying to acquire locks cluster wide. The end result is both nodes timing out in their operation:
Tx1 acquires WL on N1 for node /a/b
Tx2 acquires WL on N2 for node /a/b
Tx1 tries to commit and attempts to acquire lock for /a/b while Tx2 is trying to commit and attempt to lock that same node cluster wide.
Tx1 rollbacks because it can't acquire locks for /a/b in N2 and viceversa
As I said, I'll check the attached files in the next couple of days to confirm whether this is precisely what you're seeing. However, it's worth keeping it in mind.
It's true that pessimistic locking for 2nd level cache is currently our preferred configuration, hence it's the default for EJB3 2nd level cache in EAP/AS 4.2+ but I dunno the exact details on why PL was chosen over OL. I'm gonna forward the forum link to Brian who I think might be able to give us further information on why PL was chosen over OL.
In the mean time, I don't have enough info to determine the root cause of your issues and so I'd like you to follow this instructions, repeat the test and attach a zip file with the resulting files to the JIRA.
- Add thread logging information and enable TRACE for org.jboss.cache package. See 
- Generate thread dumps in both nodes when you're unable to access that page.
Once you've done this, could you try with the latest EAP 4.2.x CP which is EAP 4.2.0.GA_CP03? This is our latest release for customers which has newer bug fix releases for JBoss Cache and Hibernate amongst other things.
 Add to conf/jboss-log4j.xml: <category name="org.jboss.cache"> <priority value="TRACE" /> </category> Change ConversionPattern for server.log file appender to full pattern: <param name="ConversionPattern" value="%d %-5r %-5p [%c] (%t:%x) %m%n"/>
PL was chosen over OL as the default as it has been around a lot longer, has a lot more usage in the wild, and is regarded as more stable/battle tested. Over the course of 2007 we identified and resolved a number of issues with OL, the latest of which is http://jira.jboss.org/jira/browse/JBCACHE-1228 which I suspect may be the source of the problem you were seeing on JBPORTAL-1879.
With jboss cache version 1.4.1.SP8 (http://repository.jboss.com/jboss/cache/1.4.1.SP8.patch02-brew), I do not have problem accessing a portal page. Hence this certainly fixes this problem. However, I see some errors with 2 and 3 nodes:
2008-06-18 19:16:00,439 ERROR [org.jboss.cache.TreeCache] Caught TimeoutException acquiring locks on region /
org.jboss.cache.lock.TimeoutException: failure acquiring lock: fqn=/_BUDDY_BACKUP_/10.18.1.131_34704/org/jboss/portal/core/impl/model/portal/PortalImpl/windowStates/org.jboss.portal.core.impl.model.portal.PortalImpl.windowStates#3112, caller=Thread[IncomingPacketHandler (channel=portal.hibernate),5,JGroups threads], lock=write owner=GlobalTransaction:<10.18.1.131:34704>:1759 (activeReaders=0, activeWriter=Thread[IncomingPacketHandler (channel=portal.hibernate),5,], waitingReaders=0, waitingWriters=0, waitingUpgrader=0)
2008-06-18 14:49:39,603 ERROR [org.jgroups.protocols.UDP] failed unmarshalling message
Do I need to update jgroups library for this cache version? With 4-node cluster, I am getting very bad performance. I will look into that.
What JGroups version are you using? I'd recommend 2.4.2, which is what is in the latest EAP release (which is also what I'd recommend).
Indeed. The latest EAP CP releases, whether 4.2.0.GA_CP03 or 4.3.0.GA_CP01 contain up to date JBC and JGRP releases, see:
This is getting interesting. I tried with JBPAPP_4_3_0_GA_CP01 in production configuration, I am seeing the same behavior as I was seeing before. If I access a node before other comes up, everything goes fine.
I have removed the buddy replication attribute from tree cache configuration as you mentioned. The complete tree cache configuration is attached to http://jira.jboss.com/jira/browse/JBPORTAL-1879.
The error I see on server side is:
No ManagedConnections available within configured blocking timeout ( 30000 [ms] ); - nested throwable: (javax.resource.ResourceException: No ManagedConnections available within configured blocking timeout ( 30000 [ms] ))
I will take thread dump on both nodes and share my findings.
Thread dumpts attached to JBPORTAL-1879. With 5 users hitting one server, I primarily see
- waiting on <0x0000002b2dad5118> (a java.util.HashMap)
- locked <0x0000002b2dad5118> (a java.util.HashMap)
The way servers are started is:
java -server -Xms3072m -Xmx3072m -XX:MaxPermSize=128m -XX:+UseLargePages -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djgroups.udp.ip_ttl=1 -jar /tmp/pjha/eap43/bin/run.jar -c production -b 10.18.1.129 -Djboss.partition.name=PortalScalabilityTestPartition -Dcluster.route.name=perf11 -u 126.96.36.199
You're seeing those because of this kind of thing on the ndoe and/or the other node:
"IncomingPacketHandler (channel=portal.hibernate)" daemon prio=1 tid=0x0000002b70482b30 nid=0x6d11 in Object.wait() [0x0000000042573000..0x0000000042575ab0] at java.lang.Object.wait(Native Method) - waiting on <0x0000002b2dc22680> (a org.jboss.cache.lock.ReadWriteLockWithUpgrade$WriterLock) at org.jboss.cache.lock.ReadWriteLockWithUpgrade$WriterLock.attempt(ReadWriteLockWithUpgrade.java:389) - locked <0x0000002b2dc22680> (a org.jboss.cache.lock.ReadWriteLockWithUpgrade$WriterLock) at org.jboss.cache.lock.IdentityLock.acquireWriteLock(IdentityLock.java:199) at org.jboss.cache.Node.acquireWriteLock(Node.java:562) at org.jboss.cache.Node.acquire(Node.java:509) at org.jboss.cache.interceptors.OptimisticLockingInterceptor.lockNodes(OptimisticLockingInterceptor.java:160) at org.jboss.cache.interceptors.OptimisticLockingInterceptor.invoke(OptimisticLockingInterceptor.java:84) at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68) at org.jboss.cache.interceptors.OptimisticReplicationInterceptor.invoke(OptimisticReplicationInterceptor.java:85) at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68) at org.jboss.cache.interceptors.TxInterceptor.handleOptimisticPrepare(TxInterceptor.java:452) at org.jboss.cache.interceptors.TxInterceptor.handleRemotePrepare(TxInterceptor.java:336) at org.jboss.cache.interceptors.TxInterceptor.invoke(TxInterceptor.java:145) at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68) at org.jboss.cache.interceptors.CacheMgmtInterceptor.invoke(CacheMgmtInterceptor.java:183) at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:5919) at org.jboss.cache.TreeCache._replicate(TreeCache.java:5196)
That thread is the one that carries messages up from the network layer to JBoss Cache. As you can see, it's blocking waiting for a lock. While those threads block, messages don't get transmitted, leading to the blocking you see on the other threads.
The real problem here is the 2 JBC instances are fighting over a lock -- and unnecessarily so - they are both trying to cache the same data that's been directly read from the db. Logicaly, that needn't trigger any cluster traffic at all. The old Hibernate 3.2.x / JBC 1.x integration code is quite poor. Totally redone for Hibernate 3.3. But that doesn't help you. :(
How critical is this? I'm tempted to extract a version of the TreeCacheProvider out of the EAP code, improve it a bit and package it as an independent library. Will help with your issue, and be a small benefit to the community. Won't be as good as the Hibernate 3.3 stuff, but better than what you've got. But I really don't have the cycle to do it, not until AS 5 is done.
Thanks Brian. Let me first discuss this with portal folks. I have two choice now:
1. Make pessimistic locking as default configuration.
2.Male OL with REPL_ASYNC as default configuration. I have not run into this problem with REPL_ASYNC so far. OL performs much better under heavy load though.
Do you have any preference?