Shunning problem
dvm26 Mar 4, 2011 6:59 AMHi all,
we're running a cluster of 2 JBoss 5.0.1 on two linux (redhat 5) machines. JDK is 1.6.0. We're observing (after some tests on a local cluster as well) that when one server is under a big load it shunns itself from the cluster, as it's missing heartbeats. However it does not rejoin the cluster afterwards but we're getting the usual:
13:19:25,979 ERROR [JChannel] failure reconnecting to channel, retrying
org.jgroups.ChannelException: local_addr is null
at org.jgroups.JChannel.startStack(JChannel.java:1566)
at org.jgroups.JChannel.connect(JChannel.java:365)
at org.jgroups.JChannel$CloserThread.run(JChannel.java:1962)
errors. When doing these tests on my local cluster (between my workstation and another server) we observed that this behaviour only occured when my workstation was under a heavy load. If we stopped (my network admin) the multicast traffic for a while, when reconnected, JBoss would join the cluster again (merging partitions).
Is this a bug? I have tried adding: -Dbind.address=<my_ip>, I have tried -Djgroups.bind_addr=<my_ip>, always with the same behaviour.
This is the problem that starts it all when the server is under load:
17:14:06,148 WARN [GMS] I (<my_ip>:40929) am not a member of view [<other_ip>:56989|4] [<other_ip>:56989], shunning myself and leaving the group (prev_members are [<other_ip>:56989, <my_ip>:40929], current view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929])
17:14:06,149 WARN [FD] I was suspected by <other_ip>:56989; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
17:14:06,149 WARN [GMS] I (<my_ip>:40929) am not a member of view [<other_ip>:56989|4] [<other_ip>:56989], shunning myself and leaving the group (prev_members are [<other_ip>:56989, <my_ip>:40929], current view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929])
17:14:09,374 WARN [GMS] I (<my_ip>:40929) am not a member of view [<other_ip>:56989|4] [<other_ip>:56989], shunning myself and leaving the group (prev_members are [<other_ip>:56989, <my_ip>:40929], current view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929])
17:14:08,383 WARN [FD] I was suspected by <other_ip>:56989; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
17:14:08,383 WARN [FD] I was suspected by <other_ip>:56989; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
17:14:29,496 WARN [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]
17:14:25,802 WARN [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]
17:14:25,802 WARN [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]
17:14:30,084 WARN [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]
17:14:30,391 WARN [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]
17:14:31,555 WARN [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]
17:14:30,699 WARN [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]
17:14:31,038 ERROR [UDP] failed sending message to null (138 bytes)
java.lang.Exception: dest=/239.255.100.101:45688 (141 bytes)
at org.jgroups.protocols.UDP._send(UDP.java:353)
Any ideas on how we can make the shunned member rejoin the cluster? Currently we have to restart JBoss.