I've got 3 nodes running ISPN 5.1.2 under load, 99% of the actions are puts, removes and some containsKey operations of a DIST_SYNC cache with pessimistic transactions :-)
Sometimes one node stops working and blocks at containsKey():
"IS-8902-MThread-127" prio=10 tid=0x00007f18ec5af800 nid=0x1bd2 in Object.wait() [0x00007f18d4b48000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- locked <0x00000006e069f920> (a org.jgroups.protocols.FlowControl$Credit)
Any ideas what happens here and how to get around this?
Hmmm, does the node eventually start working again? If not, did you get any thread dumps from the receiver nodes to see if they have some kind of deadlock that could stop from sending credits back to senders?
If instead what you're getting is momentary blocks which eventually recover, then it might be a matter of tweaking FC settings trying the following:
1. Increase FC.max_credits - number of credit bytes, so must be below heap size.
2. Increase FC.min_threshold - percentage - this will help for slow receivers send more credits earlier and avoid senders blocking.
That is very weird. Castor is waiting for responses but no trace in the other nodes of any processing. There's no FC wait here though, just waiting for a reply.
This smells like a UDP problem in your env since I had a similar issue on my Mac due to small UDP buffers. I'd suggest you try running the same test with TCP.