6 Replies Latest reply on Apr 30, 2012 9:54 AM by galder.zamarreno

Block at containsKey() until timeout at JGroups FlowControl

tfromm Mar 16, 2012 2:10 PM

Hi,

I've got 3 nodes running ISPN 5.1.2 under load, 99% of the actions are puts, removes and some containsKey operations of a DIST_SYNC cache with pessimistic transactions :-)

Sometimes one node stops working and blocks at containsKey():

"IS-8902-MThread-127" prio=10 tid=0x00007f18ec5af800 nid=0x1bd2 in Object.wait() [0x00007f18d4b48000]

java.lang.Thread.State: TIMED_WAITING (on object monitor)

at java.lang.Object.wait(Native Method)

at org.jgroups.protocols.FlowControl$Credit.decrementIfEnoughCredits(FlowControl.java:553)

- locked <0x00000006e069f920> (a org.jgroups.protocols.FlowControl$Credit)

at org.jgroups.protocols.UFC.handleDownMessage(UFC.java:114)

at org.jgroups.protocols.FlowControl.down(FlowControl.java:341)

at org.jgroups.protocols.FlowControl.down(FlowControl.java:351)

at org.jgroups.protocols.FRAG2.down(FRAG2.java:147)

at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1033)

...

Any ideas what happens here and how to get around this?

1. Re: Block at containsKey() until timeout at JGroups FlowControl

tfromm Mar 19, 2012 6:29 AM (in response to tfromm)

If this often happens, FlowControl adjustment is nessesary. http://www.jgroups.org/papers/FlowControl.html
Actions
2. Re: Block at containsKey() until timeout at JGroups FlowControl

galder.zamarreno Mar 20, 2012 5:05 AM (in response to tfromm)

Hmmm, does the node eventually start working again? If not, did you get any thread dumps from the receiver nodes to see if they have some kind of deadlock that could stop from sending credits back to senders?

If instead what you're getting is momentary blocks which eventually recover, then it might be a matter of tweaking FC settings trying the following:

1. Increase FC.max_credits - number of credit bytes, so must be below heap size.
2. Increase FC.min_threshold - percentage - this will help for slow receivers send more credits earlier and avoid senders blocking.
Actions
3. Re: Block at containsKey() until timeout at JGroups FlowControl

tfromm Mar 20, 2012 5:16 AM (in response to galder.zamarreno)

That was the good thing: After timeout the node resumes to normal operations. I have not recognized any dataloss or something. :-)

I'll try the tweakings when this situation appears more frequent, otherwise I cannot determine if the configuration modifications have a positive effect.
Actions
4. Re: Block at containsKey() until timeout at JGroups FlowControl

galder.zamarreno Mar 20, 2012 6:18 AM (in response to tfromm)

Ok. If it happens againt, make sure you get thread dumps from all nodes in the cluster cos that way we can see what's up with not only the senders but the receivers as well.
Actions
5. Re: Block at containsKey() until timeout at JGroups FlowControl

tfromm Apr 24, 2012 8:28 AM (in response to galder.zamarreno)
Since 5.1.3 the issue appears more frequent.
I've attached thread dumps of all 3 nodes, the "castor" one is that which blocks.

Meanwhile I'll change the credit values...

thread-dump-pollux.txt.zip 16.8 KB

thread-dump-helena.txt.zip 16.4 KB

thread-dump-castor.txt.zip 18.6 KB
Actions
6. Re: Block at containsKey() until timeout at JGroups FlowControl

galder.zamarreno Apr 30, 2012 9:54 AM (in response to tfromm)

That is very weird. Castor is waiting for responses but no trace in the other nodes of any processing. There's no FC wait here though, just waiting for a reply.

This smells like a UDP problem in your env since I had a similar issue on my Mac due to small UDP buffers. I'd suggest you try running the same test with TCP.
Actions

Go to original post