JGroups exception on concurrent shutdown of cluster
deepaksrivastavaz Nov 17, 2012 8:26 AMHi all,
I recently got involved in Infinispan library usage, while playing around it I noticed that concurrent shutdown request of multiple nodes at once cause JGroups (random)exception which eventually do not let the applications stop on all nodes waiting like for ever and its always and easily reproducible.
Environment details:
JDK: 1.6.0_32
Tomcat: apache-tomcat-6.0.28
JGroups protocol stack:
a.) TCP+TCPPING+MERGE2+FD_SOCK+FD+VERIFY_SUSPECT....
b.) UDP+PING+MERGE2+FD_SOCK+FD_ALL+VERIFY_SUSPECT....
ISPN version: Infinispan 'Delirium' 5.2.0.Alpha4
JGroups version:jgroups-3.2.0.Alpha1
Nodes: 3
Location: Single Machine(A local laptop hosting Windows 7)
Interestingly the NoClassDefFoundError exception being thrown is not fixed at a point but outputs different messages randomly during different invocations:
LOG excerpt:
Excerpt 1 from Run1:
(Pasting initial lines separately as well as final posting seems rendering the message differently)
[WARN ] [2012-11-17 18:35:00,467] [main] [protocols.TCP] - [failed to join /ff0e:0:0:0:0:0:75:75:7500 on ppp2: java.net.SocketException: error setting options]
[ERROR] [] [Incoming-4,SampleCluster,DKUMAR-DL-IN-27445] [protocols.UNICAST] - [couldn't deliver message [dst: DKUMAR-DL-IN-27445, src: DKUMAR-DL-IN-50134 (3 headers), size=764 bytes]]
java.lang.NullPointerException
at org.jgroups.logging.Log4JLogImpl.isTraceEnabled(Log4JLogImpl.java:48)
[WARN ] [2012-11-17 18:35:00,467] [main] [protocols.TCP] - [failed to join /ff0e:0:0:0:0:0:75:75:7500 on ppp2: java.net.SocketException: error setting options]
[ERROR] [] [Incoming-4,SampleCluster,DKUMAR-DL-IN-27445] [protocols.UNICAST] - [couldn't deliver message [dst: DKUMAR-DL-IN-27445, src: DKUMAR-DL-IN-50134 (3 headers), size=764 bytes]]
java.lang.NullPointerException
at org.jgroups.logging.Log4JLogImpl.isTraceEnabled(Log4JLogImpl.java:48)
at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:474)
at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:387)
at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:248)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:609)
at org.jgroups.JChannel.up(JChannel.java:715)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1020)
at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:178)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:899)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:244)
at org.jgroups.protocols.UNICAST.removeAndDeliver(UNICAST.java:651)
at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:627)
at org.jgroups.protocols.UNICAST.up(UNICAST.java:399)
at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:572)
at org.jgroups.protocols.BARRIER.up(BARRIER.java:102)
at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:143)
at org.jgroups.protocols.FD.up(FD.java:273)
at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)
at org.jgroups.protocols.MERGE2.up(MERGE2.java:205)
at org.jgroups.protocols.Discovery.up(Discovery.java:359)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1210)
at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1774)
at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1747)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Exceprt 2 from Run1:
[ERROR] [] [OOB-3,SampleCluster,DKUMAR-DL-IN-27445] [protocols.TCP] - [failed handling incoming message]
java.lang.NoClassDefFoundError: org/jgroups/protocols/VERIFY_SUSPECT$Entry
at org.jgroups.protocols.VERIFY_SUSPECT.addSuspect(VERIFY_SUSPECT.java:243)
at org.jgroups.protocols.VERIFY_SUSPECT.verifySuspect(VERIFY_SUSPECT.java:192)
at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:100)
at org.jgroups.protocols.FD.up(FD.java:273)
at org.jgroups.protocols.FD_SOCK.suspect(FD_SOCK.java:467)
at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:212)
at org.jgroups.protocols.MERGE2.up(MERGE2.java:205)
at org.jgroups.protocols.Discovery.up(Discovery.java:359)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1210)
at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1774)
at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1747)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassNotFoundException: org.jgroups.protocols.VERIFY_SUSPECT$Entry
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1645)
at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1491)
... 14 more
Excerpt 3 from Run 1:
[ERROR] [] [Incoming-4,SampleCluster,DKUMAR-DL-IN-27445] [protocols.UNICAST] - [couldn't deliver message [dst: DKUMAR-DL-IN-27445, src: DKUMAR-DL-IN-50134 (3 headers), size=142 bytes]]
java.lang.NullPointerException
at org.jgroups.logging.Log4JLogImpl.isTraceEnabled(Log4JLogImpl.java:48)
at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:474)
at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:387)
at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:248)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:609)
at org.jgroups.JChannel.up(JChannel.java:715)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1020)
at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:178)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:899)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:244)
at org.jgroups.protocols.UNICAST.removeAndDeliver(UNICAST.java:651)
at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:627)
at org.jgroups.protocols.UNICAST.up(UNICAST.java:399)
at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:572)
at org.jgroups.protocols.BARRIER.up(BARRIER.java:102)
at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:143)
at org.jgroups.protocols.FD.up(FD.java:273)
at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)
at org.jgroups.protocols.MERGE2.up(MERGE2.java:205)
at org.jgroups.protocols.Discovery.up(Discovery.java:359)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1210)
at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1774)
at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1747)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Apart from Log4JLogImpl issue(I am using log4j-1.2.15.jar) it seems like there is a class unloading issue as well with server as above Except 2 outputs different classes randomly as reason of failure.
Do people do concurrent shutdown of their servers(tomcat or other) in real scenarios without any data-loss or other messages loss or is it preferred or advisable to do serial shutdown always...?
However concurrent startup doesn't create such problems.
Any help will be appreciated, let me know if you need more details around the environment?
--Deepak