Unexpected failover behaviour in distributed transaction
chrismeadows May 10, 2010 11:16 AMJBoss AS 5.1.0 GA
JDK 1.6.0 update 10 64 bit
Ubuntu 8.0.4 server 64 bit
Hi there,
I have a two node JBoss cluster, each node on a separate server. I have a non-clustered SLSB JMS client inside the container sending a JMS message to a queue that is consumed by a clustered MDB. The MDB inserts a row into a database. The JMS delivery and database insertion are governed by CMP and CMT with an XA datasource. This all works fine, I see the MDBs loadbalancing the queue consumption across the nodes in the cluster.
However, if I kill -9 one of the nodes, the failover doesn't behave as I would expect. That is, I would expect transactions that hadn't completed the atomic action (of delivering a message and inserting a row into the database) would failover to the remaining node and complete successfully. This is not what I see.
If I send (say) 500 messages, then initiate failover, the JBoss logs detect the node failure and indicate that failover is complete. But not all the messages failover, and I don't end up with the expected number of rows inserted in the database. The JBM_MSG and JBM_MSG_REF tables still have data in them.
If I saw no failover at all, I'd suspect my own understanding of what I'm doing. But the fact that some messages/databases inserts failover and some don't lead me to believe that I've either got some screwy JB config, or hit a JB defect. The XA concept appears fine (that is, I don't see situations where a message is delivered and a db insert doesn't occur or vice versa)
The JMSClient uses the jmsXA ConnectionFactory. Each JBoss node has a different ServerPeerID (0 and 99)
I can upload a full sample maven project if need be, but here are the relevant code portions.
The JMS Client is on node 1 and is not clustered
{code}
@Stateless
@LocalBinding(jndiBinding = "Sender/local")
@RemoteBinding(jndiBinding = "Sender/remote")
public class Sender implements SenderLocal, SenderRemote {
ConnectionFactory connectionFactory;
Queue queue;
private Log log = LogFactory.getLog(getClass());
public Sender() {
}
public void requestProduct(int iMessage) throws Exception {
Context c = getInitialContext();
connectionFactory = (ConnectionFactory)c.lookup(JNDIConstants.JMS_CONNECTION_FACTORY);
queue = (Queue) c.lookup(JNDIConstants.MDBCLUSTERTEST_QUEUE_JNDI);
requestProductImpl(iMessage);
}
private Context getInitialContext() {
InitialContext ic = null;
try {
Properties p = new Properties();
p.put(Context.INITIAL_CONTEXT_FACTORY, "org.jnp.interfaces.NamingContextFactory");
p.put(Context.URL_PKG_PREFIXES, "org.jboss.naming:org.jnp.interfaces");
p.put(Context.PROVIDER_URL, "10.3.94.132" + ":1100" + "," + "10.3.94.122" + ":1100");
ic = new InitialContext(p);
} catch (NamingException e) {
e.printStackTrace();
}
return ic;
}
public void requestProductImpl(int iMessage) throws Exception {
Connection connection = null;
Session session = null;
MessageProducer sender = null;
try {
connection = connectionFactory.createConnection();
session = connection.createSession(true, Session.AUTO_ACKNOWLEDGE);
ObjectMessage m = session.createObjectMessage();
FooMessage message = new FooMessage();
message.sMessage = "Message: " + iMessage;
m.setObject(message);
sender = session.createProducer(queue);
sender.send(m);
log.info(message.sMessage + " sent to queue");
}
catch (RuntimeException e) {
log.error("RuntimeException caught at sender: " + e.getMessage(), e );
throw e;
}
catch (Exception e) {
log.error("Exception caught at sender: " + e.getMessage(), e );
throw e;
}
finally {
if( sender != null ) {
sender.close();
}
if( session != null ) {
session.close();
}
if( connection != null ) {
connection.close();
}
}
}
}
{code}
and the MDB receiver is on both nodes and is clustered (via the queue and postoffice) config
{code}
@MessageDriven(activationConfig = {
@ActivationConfigProperty(propertyName = "destinationType", propertyValue = "javax.jms.Queue"),
@ActivationConfigProperty(propertyName = "destination", propertyValue = JNDIConstants.MDBCLUSTERTEST_QUEUE_JNDI) })
public class Receiver implements MessageListener {
static int iCount = 0;
private static Log log = LogFactory.getLog(Receiver.class);
@Resource
private MessageDrivenContext mdc;
@EJB(mappedName = JNDIConstants.ACCOUNTING_LOG_SERVICE_LOCAL_JNDI)
AccountingLogService acclog;
@TransactionAttribute(TransactionAttributeType.REQUIRED)
public void onMessage(Message m) {
String sMessage = null;
try {
ObjectMessage tm = (ObjectMessage) m;
if (!(tm.getObject() instanceof FooMessage)) {
throw new IllegalArgumentException("message is not a FooMessage: " + tm.getObject().getClass());
}
sMessage = ((FooMessage)tm.getObject()).sMessage;
log.info("Received: " + sMessage + " iCount = " + ++iCount );
acclog.log(new AccountingLogEntryBean(sMessage));
log.info("Persisted: " + sMessage + " iCount = " + iCount );
} catch (Exception e) {
mdc.setRollbackOnly();
log.info("Rolling Back receiver: " + sMessage + " : " + e.getMessage(), e);
throw new RuntimeException(e);
}
}
}
{code}
The queue is nothing unusual
{code}
<mbean code="org.jboss.jms.server.destination.QueueService
name="jboss.messaging.destination:service=Queue,name=MDBClusterTestQueue"
xmbean-dd="xmdesc/Queue-xmbean.xml">
<depends optional-attribute-name="ServerPeer">jboss.messaging:service=ServerPeer</depends>
<depends>jboss.messaging:service=PostOffice</depends>
<attribute name="DLQ">jboss.messaging.destination:service=Queue,name=MDBClusterTestQueueDLQ</attribute>
<attribute name="ExpiryQueue">jboss.messaging.destination:service=Queue,name=MDBClusterTestQueueExpiryQueue</attribute>
<attribute name="RedeliveryDelay">5000</attribute>
<attribute name="Clustered">true</attribute>
</mbean>{code}
The AccountingLog is just a simple DAO facade
{code}
@Stateless
@LocalBinding(jndiBinding = JNDIConstants.ACCOUNTING_LOG_SERVICE_LOCAL_JNDI)
@RemoteBinding(jndiBinding = JNDIConstants.ACCOUNTING_LOG_SERVICE_REMOTE_JNDI)
public class AccountingLogServiceBean implements AccountingLogServiceRemote, AccountingLogServiceLocal {
@PersistenceContext(unitName = "AccountingLogService")
private EntityManager em;
static Log log = LogFactory.getLog(AccountingLogServiceBean.class);
public AccountingLogServiceBean() {
}
@SuppressWarnings("unchecked")
@TransactionAttribute(TransactionAttributeType.REQUIRED)
public void log(AccountingLogEntryBean entry) {
em.persist(entry);
}
}{code}
The persistence xml file uses JTA and an XA datasource
{code}
<persistence xmlns="http://java.sun.com/xml/ns/persistence"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd"
version="1.0">
<persistence-unit name="AccountingLogService" transaction-type="JTA">
<jta-data-source>java:/OracleXADS</jta-data-source>
<class>com.foo.mdbcluster.common.entity.AccountingLogEntryBean</class>
<properties>
<property name="hibernate.dialect"
value="org.hibernate.dialect.Oracle10gDialect" />
<property name="hibernate.hbm2ddl.auto" value="update" />
</properties>
</persistence-unit>
</persistence>{code}
The XA datasource itself is
{code}
<datasources>
<xa-datasource>
<jndi-name>OracleXADS</jndi-name>
<track-connection-by-tx></track-connection-by-tx>
<isSameRM-override-value>false</isSameRM-override-value>
<xa-datasource-property name="URL">jdbc:oracle:oci:@(DESCRIPTION=(LOAD_BALANCE=YES)(FAILOVER=ON)(ADDRESS=(PROTOCOL=TCP)(HOST=10.3.94.135)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=10.3.94.136)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=FOO_TAF)(FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC)(RETRIES=20)(DELAY=15))))</xa-datasource-property>
<xa-datasource-class>oracle.jdbc.xa.client.OracleXADataSource</xa-datasource-class>
<xa-datasource-property name="User">xxxx</xa-datasource-property>
<xa-datasource-property name="Password">xxxx</xa-datasource-property>
<no-tx-separate-pools></no-tx-separate-pools>
<prepared-statement-cache-size>100</prepared-statement-cache-size>
<share-prepared-statements>true</share-prepared-statements>
<track-statements>warn</track-statements>
<min-pool-size>140</min-pool-size>
<max-pool-size>140</max-pool-size>
<blocking-timeout-millis>180000</blocking-timeout-millis>
<transaction-isolation>TRANSACTION_READ_COMMITTED</transaction-isolation>
<exception-sorter-class-name>org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter</exception-sorter-class-name>
<!-- corresponding type-mapping in the standardjbosscmp-jdbc.xml --><metadata>
<type-mapping>Oracle9i</type-mapping>
</metadata>
</xa-datasource>
</datasources>{code}
The JMS oracle-persistence-service is plumbed to the same XA datasource, and is using a clustered post office.
For testing convenience, I have a Service MBean for invoking the JMS Client
{code}
@Service( objectName = JNDIConstants.TESTER_SERVICE)
@Management(MDBClusterTestManagement.class)
public class MDBClusterTest implements MDBClusterTestManagement {
private Log log = LogFactory.getLog(getClass());
@EJB( mappedName = "Sender/local")
SenderLocal sender;
Context c = null;
ConnectionFactory connectionFactory;public void startTest(int iNumMessages) {
c = getInitialContext();
try {
connectionFactory = (ConnectionFactory)c.lookup(JNDIConstants.JMS_CONNECTION_FACTORY);
for( int i = 0; i < iNumMessages; i++) {
try {
sender.requestProduct(i);
} catch (Exception e) {
log.error("Exception at tester level:" + e.getMessage(), e);
}
}} catch (NamingException e) {
e.printStackTrace();
}
}
private Context getInitialContext() {
InitialContext ic = null;
try {
Properties p = new Properties();
p.put(Context.INITIAL_CONTEXT_FACTORY, "org.jnp.interfaces.NamingContextFactory");
p.put(Context.URL_PKG_PREFIXES, "org.jboss.naming:org.jnp.interfaces");
p.put(Context.PROVIDER_URL, "10.3.94.132" + ":1100" + "," + "10.3.94.122" + ":1100");
ic = new InitialContext(p);
} catch (NamingException e) {
e.printStackTrace();
}
return ic;
}
}{code}
So I have node 1 with an MBean that invokes a JMS client 500 times on node 1. This sends 500 messages load balanced across node 1 and node 2. During the message consumption, I kill node 2 with kill -9
A sample log file of node 1 (that remains up) is
{code}
2010-05-10 13:36:06,837 INFO [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 0 sent to queue
2010-05-10 13:36:06,848 INFO [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 1 sent to queue
2010-05-10 13:36:06,851 INFO [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 2 sent to queue
2010-05-10 13:36:06,854 INFO [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 3 sent to queue
...
2010-05-10 13:36:07,629 INFO [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 497 sent to queue
2010-05-10 13:36:07,631 INFO [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 498 sent to queue
2010-05-10 13:36:07,632 INFO [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 499 sent to queue
2010-05-10 13:36:09,113 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-9) Received: Message: 6 iCount = 1
2010-05-10 13:36:09,137 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-15) Received: Message: 12 iCount = 3
2010-05-10 13:36:09,141 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-14) Received: Message: 11 iCount = 11
2010-05-10 13:36:09,139 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-10) Received: Message: 7 iCount = 6
2010-05-10 13:36:09,141 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-4) Received: Message: 1 iCount = 12
2010-05-10 13:36:09,137 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-13) Received: Message: 10 iCount = 5
2010-05-10 13:36:09,139 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-17) Received: Message: 14 iCount = 9
2010-05-10 13:36:09,137 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-6) Received: Message: 3 iCount = 4
...
2010-05-10 13:36:12,764 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-81) Received: Message: 78 iCount = 78
2010-05-10 13:36:12,766 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-81) Persisted: Message: 78 iCount = 78
2010-05-10 13:36:12,769 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-82) Received: Message: 79 iCount = 79
2010-05-10 13:36:12,771 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-82) Persisted: Message: 79 iCount = 79
2010-05-10 13:36:12,799 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] (VERIFY_SUSPECT.TimerThread,DefaultPartition,10.3.94.132:34771) Suspected member: 10.3.94.122:47594
2010-05-10 13:36:12,814 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-83) Received: Message: 80 iCount = 80
2010-05-10 13:36:12,816 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-83) Persisted: Message: 80 iCount = 80
2010-05-10 13:36:12,818 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-84) Received: Message: 81 iCount = 81
2010-05-10 13:36:12,818 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-85) Received: Message: 82 iCount = 82
2010-05-10 13:36:12,821 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-84) Persisted: Message: 81 iCount = 82
2010-05-10 13:36:12,823 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-85) Persisted: Message: 82 iCount = 82
2010-05-10 13:36:12,829 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-86) Received: Message: 83 iCount = 83
2010-05-10 13:36:12,831 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-86) Persisted: Message: 83 iCount = 83
2010-05-10 13:36:12,847 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] (Incoming-8,10.3.94.132:34771) New cluster view for partition DefaultPartition (id: 2, delta: -1) : [10.3.94.132:1099]
2010-05-10 13:36:12,847 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) I am (10.3.94.132:1099) received membershipChanged event:
2010-05-10 13:36:12,847 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) Dead members: 1 ([10.3.94.122:1099])
2010-05-10 13:36:12,847 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) New Members : 0 ([])
2010-05-10 13:36:12,847 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) All Members : 1 ([10.3.94.132:1099])
2010-05-10 13:36:12,883 INFO [org.jboss.messaging.core.impl.postoffice.GroupMember] (Incoming-15,10.3.94.132:34771) org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener@32c98179 got new view [10.3.94.132:34771|2] [10.3.94.132:34771], old view is [10.3.94.132:34771|1] [10.3.94.132:34771, 10.3.94.122:47594]
2010-05-10 13:36:12,884 INFO [org.jboss.messaging.core.impl.postoffice.GroupMember] (Incoming-15,10.3.94.132:34771) I am (10.3.94.132:34771)
2010-05-10 13:36:12,903 INFO [org.jboss.cache.RPCManagerImpl] (Incoming-19,10.3.94.132:34771) Received new cluster view: [10.3.94.132:34771|2] [10.3.94.132:34771]
2010-05-10 13:36:12,904 ERROR [org.jgroups.blocks.ConnectionTable] (ViewHandler,MessagingPostOffice-DATA,10.3.94.132:7900) failed sending data to 10.3.94.122:7900: java.net.SocketException: Socket closed
2010-05-10 13:36:12,946 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-87) Received: Message: 84 iCount = 84
2010-05-10 13:36:12,976 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-89) Received: Message: 86 iCount = 85
2010-05-10 13:36:12,976 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-88) Received: Message: 85 iCount = 85
2010-05-10 13:36:12,985 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-90) Received: Message: 87 iCount = 86
2010-05-10 13:36:12,985 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-88) Persisted: Message: 85 iCount = 86
2010-05-10 13:36:12,985 INFO [org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] (Incoming-15,10.3.94.132:34771) JBoss Messaging is failing over for failed node 99. If there are many messages to reload this may take some time...
2010-05-10 13:36:12,987 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-90) Persisted: Message: 87 iCount = 86
2010-05-10 13:36:12,987 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-89) Persisted: Message: 86 iCount = 86
2010-05-10 13:36:12,987 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-87) Persisted: Message: 84 iCount = 86
2010-05-10 13:36:12,989 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-91) Received: Message: 88 iCount = 87
...
2010-05-10 13:36:13,781 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-6) Persisted: Message: 123 iCount = 121
2010-05-10 13:36:13,781 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-28) Persisted: Message: 122 iCount = 121
2010-05-10 13:36:13,923 INFO [org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] (Incoming-15,10.3.94.132:34771) JBoss Messaging failover completed
2010-05-10 13:36:13,961 INFO [org.jboss.messaging.core.impl.postoffice.GroupMember] (Incoming-15,10.3.94.132:34771) Dead members: 1 ([10.3.94.122:47594])
2010-05-10 13:36:13,961 INFO [org.jboss.messaging.core.impl.postoffice.GroupMember] (Incoming-15,10.3.94.132:34771) All Members : 1 ([10.3.94.132:34771])
2010-05-10 13:36:14,014 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-23) Received: Message: 124 iCount = 122
2010-05-10 13:36:14,014 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-33) Received: Message: 125 iCount = 123
2010-05-10 13:36:14,015 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-33) Persisted: Message: 125 iCount = 123
2010-05-10 13:36:14,015 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-23) Persisted: Message: 124 iCount = 123
...
2010-05-10 13:36:25,240 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-54) Received: Message: 499 iCount = 450
2010-05-10 13:36:25,240 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-48) Received: Message: 498 iCount = 451
2010-05-10 13:36:25,240 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-11) Persisted: Message: 497 iCount = 451
2010-05-10 13:36:25,241 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-54) Persisted: Message: 499 iCount = 451
2010-05-10 13:36:25,241 INFO [com.foo.mdbcluster.Receiver] (WorkManager(2)-48) Persisted: Message: 498 iCount = 451
2010-05-10 13:36:36,594 WARN [org.jboss.jms.server.connectionmanager.SimpleConnectionManager] (Timer-3) A problem has been detected with the connection to remote client a32m3e-vq2wtx-g91c4dzt-1-g91c4vwl-ab, jmsClientID=null. It is possible the client has exited without closing its connection(s) or the network has failed. All associated connection resources will be cleaned up.
{code}
That's the end of the log. After node 1 has reached this quiescent state, I will typically see 4-10 messages not being failed over.
Is there something of interest here worth investigating?
Thanks for any help/guidance
Chris