1 Reply Latest reply on Mar 4, 2011 8:31 AM by wdfink

    Shunning problem

    dvm26

      Hi all,

      we're running a cluster of 2 JBoss 5.0.1 on two linux (redhat 5) machines. JDK is 1.6.0. We're observing (after some tests on a local cluster as well) that when one server is under a big load it shunns itself from the cluster, as it's missing heartbeats. However it does not rejoin the cluster afterwards but we're getting the usual:

       

      13:19:25,979 ERROR [JChannel] failure reconnecting to channel, retrying

      org.jgroups.ChannelException: local_addr is null

          at org.jgroups.JChannel.startStack(JChannel.java:1566)

          at org.jgroups.JChannel.connect(JChannel.java:365)

          at org.jgroups.JChannel$CloserThread.run(JChannel.java:1962)

       

      errors. When doing these tests on my local cluster (between my workstation and another server) we observed that this behaviour only occured when my workstation was under a heavy load. If we stopped (my network admin) the multicast traffic for a while, when reconnected, JBoss would join the cluster again (merging partitions).

       

      Is this a bug? I have tried adding: -Dbind.address=<my_ip>, I have tried -Djgroups.bind_addr=<my_ip>, always with the same behaviour.

       

      This is the problem that starts it all when the server is under load:

      17:14:06,148 WARN  [GMS] I (<my_ip>:40929) am not a member of view [<other_ip>:56989|4] [<other_ip>:56989], shunning myself and leaving the group (prev_members are [<other_ip>:56989, <my_ip>:40929], current view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929])

      17:14:06,149 WARN  [FD] I was suspected by <other_ip>:56989; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      17:14:06,149 WARN  [GMS] I (<my_ip>:40929) am not a member of view [<other_ip>:56989|4] [<other_ip>:56989], shunning myself and leaving the group (prev_members are [<other_ip>:56989, <my_ip>:40929], current view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929])

      17:14:09,374 WARN  [GMS] I (<my_ip>:40929) am not a member of view [<other_ip>:56989|4] [<other_ip>:56989], shunning myself and leaving the group (prev_members are [<other_ip>:56989, <my_ip>:40929], current view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929])

      17:14:08,383 WARN  [FD] I was suspected by <other_ip>:56989; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      17:14:08,383 WARN  [FD] I was suspected by <other_ip>:56989; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      17:14:29,496 WARN  [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]

      17:14:25,802 WARN  [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]

      17:14:25,802 WARN  [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]

      17:14:30,084 WARN  [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]

      17:14:30,391 WARN  [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]

      17:14:31,555 WARN  [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]

      17:14:30,699 WARN  [NAKACK] <my_ip>:40929] discarded message from non-member <other_ip>:56989, my view is [<other_ip>:56989|3] [<other_ip>:56989, <my_ip>:40929]

      17:14:31,038 ERROR [UDP] failed sending message to null (138 bytes)

      java.lang.Exception: dest=/239.255.100.101:45688 (141 bytes)

          at org.jgroups.protocols.UDP._send(UDP.java:353)

       

       

      Any ideas on how we can make the shunned member rejoin the cluster? Currently we have to restart JBoss.

        • 1. Shunning problem
          wdfink

          As I've had problems with clustering as well.

          If all configured well (multicast in the network as well) the cluster should find together after trouble separating.

           

          In case of heavy load, GC activity or network trouble the cluster might be temporary separated.

          In our case it was a misconfigured multicast (explicit configuration of multicast to a physical ip) and the jboss was bound to a logical (virtuell) ip.

          Effect was that a lot of mcasts are not received, the cluster behaviour is very fuzzy, nodes found together or build a cluster without coordinator. LoadBalancing will not work ;(

           

          Do you check with JGroups stand-alone see wiki http://community.jboss.org/wiki/TestingJBoss

           

          Also you should upgrade to JBoss5.1 because of bugfixes!