1 Reply Latest reply: Apr 26, 2012 4:53 AM by Wolf-Dieter Fink RSS

compacting garbage collector timeouts and cluster in jboss 5.1

Armin Haaf Newbie

we have a jboss 5.1 cluster with 3 nodes, each 2GB heap. The VMs runs with "-XX:+UseParNewGC -XX:+UseConcMarkSweepGC", which works most of the time without problems.

 

However sometimes a VM does a compacting garbage collection -> this means a stop of  50-80seconds. In this time the node got suspected by the other nodes.

After the node gets responsive again it logs:

 

server.log.2012-04-24_08-28-37:2012-04-24 08:10:11,385 WARN  [org.jgroups.protocols.FD] [T:125798] I was suspected by 10.199.18.13:39310; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

server.log.2012-04-24_08-28-37:2012-04-24 08:10:11,386 WARN  [org.jgroups.protocols.FD] [T:125798] I was suspected by 10.199.18.13:39310; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

server.log.2012-04-24_08-28-37:2012-04-24 08:10:11,386 WARN  [org.jgroups.protocols.FD] [T:125798] I was suspected by 10.199.18.13:39310; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

server.log.2012-04-24_08-28-37:2012-04-24 08:10:11,387 DEBUG [org.jgroups.protocols.pbcast.FLUSH] [T:127] Received START_FLUSH at 10.199.18.11:45393 but I am not flush participant, not responding

server.log.2012-04-24_08-28-37:2012-04-24 08:10:11,387 DEBUG [org.jgroups.protocols.pbcast.FLUSH] [T:127] Received START_FLUSH at 10.199.18.11:45393 but I am not flush participant, not responding

server.log.2012-04-24_08-28-37:2012-04-24 08:10:11,387 DEBUG [org.jgroups.protocols.pbcast.FLUSH] [T:127] Received START_FLUSH at 10.199.18.11:45393 but I am not flush participant, not responding

server.log.2012-04-24_08-28-37:2012-04-24 08:10:11,388 DEBUG [org.jgroups.protocols.pbcast.GMS] [T:127] view=[10.199.18.12:39800|9] [10.199.18.12:39800, 10.199.18.13:39310]

server.log.2012-04-24_08-28-37:2012-04-24 08:10:11,388 DEBUG [org.jgroups.protocols.pbcast.GMS] [T:127] [local_addr=10.199.18.11:45393] view is [10.199.18.12:39800|9] [10.199.18.12:39800, 10.199.18.13:

39310]

server.log.2012-04-24_08-28-37:2012-04-24 08:10:11,388 WARN  [org.jgroups.protocols.pbcast.GMS] [T:127] I (10.199.18.11:45393) am not a member of view [10.199.18.12:39800|9] [10.199.18.12:39800, 10.199

.18.13:39310], shunning myself and leaving the group (prev_members are [10.199.18.12:34166, 10.199.18.13:60923, 10.199.18.11:45393, 10.199.18.12:39800, 10.199.18.13:39310], current view is [10.199.18.1

1:45393|8] [10.199.18.11:45393, 10.199.18.12:39800, 10.199.18.13:39310])

 

After this the cluster is broken and at least the node with the compacting GC must be restarted, sometimes the whole cluster is broken and must be restarted.

 

Is there a configuration to avoid such problems ?

  • 1. Re: compacting garbage collector timeouts and cluster in jboss 5.1
    Wolf-Dieter Fink Master

    I'm not sure about the cluster split and why the node does not join the cluster after the GC. I've seen cases where the network drops sometimes some of the multicast packages which might not a problem during normal operation but in case of such split.

    You should keep an eye on it.

     

    The important thing is to eliminate the long pauses of full GC and this will be a hard work as well.

    One option is to use the incremental mode  (-XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=10) see [1] for more information.

    Also you should analyze the memory footprint, often it happen that objects survive a minor GC to often and go into the OldGenSpace (but die here imediately), in this case it might help to increase the young and survivor areas to avoid it.

    To analyze you can use jstat, visualvm or jconsole. Where jstat can be used in production without a big impact to the running VM.

     

    [1] http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

    [2] http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html