1 2 3 4 5 6 Previous Next 80 Replies Latest reply on Jan 27, 2010 4:51 PM by marklittle Go to original post
      • 30. Re: Jboss transaction recovery issue
        jhalliday

        12:41:46,843 DEBUG[arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_25] -Restored action status of ActionStatus.COMMITTED

        ...

        12:41:46,843 DEBUG[arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_27] -Restored heuristic decision of TwoPhaseOutcome.HEURISTIC_HAZARD

         

        It looks like that transaction is not in-doubt and therefore does not need to get recovered. It is finished with a heuristic outcome and is retained in the logs for information only. The most likely case of the db believing the tx is in-doubt whilst the tx mgr believes it is heuristically terminated, is mishandling of the XAException codes from commit by either the db driver or the tx mgr. Although you don't include the debug log from before the AS crash, I assume you are not halting the AS on a breakpoint and killing the JVM in say XAResourceRecord.topLevelCommit, but rather allowing it to continue processing for a time after getting a forced error from the db's XAResource.commit caused by the network outage. That is when it rewrites the log on disk from PREPARED (i.e. still needs to be driven to completion at recovery time) to COMMITTED (i.e. finished, take no further action) as a result of the XAException from the failed resource. If that is the case, stick on a debug breakpoint on XAResourceRecord.topLevelCommit and let me know what Exception (and what code if it's an XAException) is being thrown.

        • 31. Re: Jboss transaction recovery issue
          scarceller

          I do not simply kill the AS, I wait for all trans to timeout before doing the cold re-start.

           

          1 - start up load generator (50tps, 25 clients, 100% CPU, avgResponseTime=500ms/tran)

          2 - pull network on 2nd Database

          3 - quickly look for indoubt, I make sure I have one or more

          4 - leave the 2nd DB disconnected

          5 - wait for all trans to timeout (trantimeout=20seconds, dbtimeout=15seconds)

          6 - stop the AS - sometimes the AS hangs and won't stop - I have no choice but to kill it here, I always wait for trans to timeout/end before stopping.

          7 - reconnect network on 2nd DB

          8 - compare records (ORDERS) between 1st and 2nd DB, and find the 1st DB has one extra order and the 2nd DB has the matching one in an in-doubt state.

          9 - restart the AS

          10 - in-doubt never gets commited on 2nd DB

           

          NOTE: I do not reconnect the 2nd DB till the AS times out all trans and is gracefully shutdown. But be aware I also have test cases that simply power off the AppServer while under load, these are even harder for any AS to pass. Won't even talk about those right now, I need to concentrate on this simpler test case.

           

          EDIT:

          Please note: I do not cause these in-doubt conditions with break-points as doing things with break-points most likely would not stress thread saftey issues that could be present in the AS and TM. These test cases are real-world tests that could occur in high transaction volume systems. I'm not intrested in causing single tran failures with breakpoints.

          • 32. Re: Jboss transaction recovery issue
            scarceller

            Attached is a zip file from a test run that resulted in 3 in-doubts in the 2nd DB (Oracle), all 3 in-doubts must be commited as I confirmed the Orders exist in the 1st DB (DB2).

             

            The zip has the ObjectStore directories and a .rtf file with the log output and some of my comments in appropriate places.

            • 33. Re: Jboss transaction recovery issue
              marklittle
              I was away from this for a few hours so haven't had a chance to catch up. I'll talk with Jonathan and Andrew tomorrow.
              • 34. Re: Jboss transaction recovery issue

                Thanks for talking.

                I have another problem, originally intended say it after this problem resolved. I feel say out now is better.

                My env:

                -Stand alone JBOSSTS_4_6_0_GA(No Jboss AS)

                -JACOrb2.2.1

                -ORACLE 10.2.0.1.0

                 

                In my test I stop one Oracle server, then restart it.  The recovery service should reconnect to the ORACLE server automatically, but in my test I found it didn't reconnect successfully.

                In JBOSSTS_HOME/etc/jbossts-properties.xml, I added the XAResourceRecovery configuration:

                <property name="com.arjuna.ats.jta.recovery.XAResourceRecovery.oracle1" 
                     value="com.arjuna.ats.internal.jdbc.recovery.OracleXARecovery;oraclexarecovery1.xml"/>
                <property name="com.arjuna.ats.jta.recovery.XAResourceRecovery.oracle2" 
                     value="com.arjuna.ats.internal.jdbc.recovery.OracleXARecovery;oraclexarecovery2.xml"/>
                

                I saw the source code of class OracleXARecovery:

                private final void createConnection()
                        throws SQLException
                    {
                        try
                        {
                            if (_dataSource == null)
                                createDataSource();
                            if (_connection == null)
                            {
                            ......
                

                if I change one line, the reconnection could work well:

                private final void createConnection()
                        throws SQLException
                    {
                        try
                        {
                            if (_dataSource == null)
                                createDataSource();
                            if (_connection == null || _connection.getConnection().isClosed() )
                            {
                            ......

                 

                I don't know is there something configured wrong? or it is a bug?

                • 35. Re: Jboss transaction recovery issue
                  adinn

                  scarceller wrote:

                   

                  I do not simply kill the AS, I wait for all trans to timeout before doing the cold re-start.

                   

                  1 - start up load generator (50tps, 25 clients, 100% CPU, avgResponseTime=500ms/tran)

                  2 - pull network on 2nd Database

                  3 - quickly look for indoubt, I make sure I have one or more

                  4 - leave the 2nd DB disconnected

                  5 - wait for all trans to timeout (trantimeout=20seconds, dbtimeout=15seconds)

                  6 - stop the AS - sometimes the AS hangs and won't stop - I have no choice but to kill it here, I always wait for trans to timeout/end before stopping.

                  7 - reconnect network on 2nd DB

                  8 - compare records (ORDERS) between 1st and 2nd DB, and find the 1st DB has one extra order and the 2nd DB has the matching one in an in-doubt state.

                  9 - restart the AS

                  10 - in-doubt never gets commited on 2nd DB

                   

                  NOTE: I do not reconnect the 2nd DB till the AS times out all trans and is gracefully shutdown. But be aware I also have test cases that simply power off the AppServer while under load, these are even harder for any AS to pass. Won't even talk about those right now, I need to concentrate on this simpler test case.

                   

                  EDIT:

                  Please note: I do not cause these in-doubt conditions with break-points as doing things with break-points most likely would not stress thread saftey issues that could be present in the AS and TM. These test cases are real-world tests that could occur in high transaction volume systems. I'm not intrested in causing single tran failures with breakpoints.

                   

                  Ok, then the outcome will be as Jonathan described in the note before this one. The transaction coordinator will have been notified of the failure to commit the 2nd Db (the XAResource.commit call throws an XA exception to notfiy this). The coordinator responds by rewriting the object store log entry with status COMMITED but also with details which indicate that the outcome was a HEURISTIC_COMMIT. If you look in your server log you should see a message which details a heuristic outcome for this transaction.

                   

                  This explains why the log record is ignored by the recovery system. The log record you see left behind in the object store merely serves to record details of the heuristic outcome -- you can use the log viewer to inspect it, confirm that its status is COMMITTED and identify the heuristic participant(s). n.b. this  is not a record of an in flight transaction; the transaction has been resolved but with an error. Heuristic outcomes are legitimate ways of dealing with commit failures and they normally require manual intervention to resolve and/or correct the resulting database states -- the log record i san aid to this process.

                   

                  If instead you had killed the JVM just before the call to XAResource.commit then, assuming the DB was still up and running when JBoss restarted, you would see the PREPARED transaction record in the log get loaded by the recovery module and you would see the transaction be rolled forward.

                   

                  Note that the progress of the  XAResource.commit call depends which database and driver you are using. Not all databases/drivers handle communication failures in the same way. So, in other cases the transaction coordinator may wait longer for the driver to timeout or may even sit forever until the DB comes back. In other cases the thrown exception may differ e.g. XA_RETRY vs XA_ERR. I am afraid that's part of our real world.

                  • 36. Re: Jboss transaction recovery issue
                    marklittle
                    Best to take this (the issue about reconnecting to Oracle automatically) to a separate forum posting. Things are confusing enough as they stand ;-)
                    • 37. Re: Jboss transaction recovery issue
                      marklittle
                      Yes, Andrew is right: if this is your scenario then the only way that recovery will happen automatically is if the call to commit fails and the underlying XAResource implementation returns an XA_RETRY. Otherwise we simply don't have enough information to know for certain what happened and fail safe, assuming a heuristic outcome of one kind or another, leaving it up to the admin to figure out. The type of heuristic may be indicated by the return code in the XAException: for example, XAException.XA_RBROLLBACK means we're in a heuristic-rollback situation.
                      • 38. Re: Jboss transaction recovery issue
                        scarceller

                        mark.little@jboss.com wrote:

                         

                        Yes, Andrew is right: if this is your scenario then the only way that recovery will happen automatically is if the call to commit fails and the underlying XAResource implementation returns an XA_RETRY. Otherwise we simply don't have enough information to know for certain what happened and fail safe, assuming a heuristic outcome of one kind or another, leaving it up to the admin to figure out. The type of heuristic may be indicated by the return code in the XAException: for example, XAException.XA_RBROLLBACK means we're in a heuristic-rollback situation.

                        Mark,

                         

                        Before you read below, please understand I just want to be sure that the behavior of the JBoss TM I'm seeing is 'as designed' and not a Bug or something not configured correctly. Keep this in mind as ou read on.

                         

                        I have tested this same exact test case on several other J2EE App Servers and they always pass this test case. Meaning, if an in-doubt is left behind (for any reason) on the 2nd DB and the work has already been commited to the 1st DB they pass this test case once the AS can re-established communications to the 2nd DB. Meaning the in-doubt is always resolved by the TM.

                         

                        The condition for this test is as follows:

                        - 1st DB is commited (can not be un-done)

                        - TM fails the commit on the 2nd DB, this means the TM COULD have sent the commit but never recieved the response. The TM is not sure what happen here but it knows it did not get the 'GOOD' commit response from the 2nd DB. - read on, more details follow  -

                         

                        In this case the 2 DataBases could be in one of 2 states:

                        1) 1st DB Commited but 2nd is not (most likely this case)

                        2) 1st DB commited and 2nd is also commited. (not as likely but can occur: in this case it could simply be that the network outage occured while the 2nd DB was commiting)

                         

                        But in both those cases the TM certainly knows it never completely the commit:

                        1) it either may not have been able to send the commit request

                        - or -

                        2) it may have sent the request and not recieved the commit response.

                        In either of the above 2 cases the TM should treat them as a failed commit.

                         

                        In the end the only 100% certain way for the TM to know the 2nd DB commited is if it gets a GOOD Commit Reply from the DB. If while trying to commit it gets a timeout or any other ERROR reply (not GOOD reply) then the TM simply needs to retry the commit at a later time as outlined in next paragraph.

                         

                        In either case the TM should not remove the tran from the Object Store and later when the network is resored the TM simply checks to see if that tran is still in-doubt in the 2nd DB in need of commit. If it is in the DB then TM replays the commit and if not in the DB simply remove it from the ObjectStore as it has already been commited.

                         

                        Bottom line: if the TM can't handle this test case (leaves the in-doubt in the 2nd DB) this is not good since in our case it leaves product INVENTORY records locked and those locked products can no longer be ordered till this in-doubt is resolved. And if the TM does not (or can't) resolve this it then must be fixed by a Human, this human has a bit of work to do because they MUST figure out the state of the other participants (in my case the 1st DB) then decide the correct action for the in-doubt in the 2nd DB (commit or rollback). After determining the correct action they need to then perform the commit or rollback on the in-doubt. Then they must go to the ObjectStore in the TM and clean this tran out as well. Keep in mind my test case only uses 2 XA participants and if it where to use say 5 participants this could get way more complicated as you could have 3 participants commited and 2 in-doubt.

                         

                        With no disrespect I'm simply letting you know this is the first J2EE TM I've tested that can't handle this scenario.

                        • 39. Re: Jboss transaction recovery issue
                          marklittle
                          Hi Sal. This is just to let you know that I'll reply to your latest entry a bit later today.
                          • 40. Re: Jboss transaction recovery issue
                            jhalliday

                            "In the end the only 100% certain way for the TM to know the 2nd DB commited is if it gets a GOOD Commit Reply from the DB. If while trying to commit it gets a timeout or any other ERROR reply (not GOOD reply) then the TM simply needs to retry the commit at a later time"

                             

                            That's utter nonsense. Read the XA spec. Now read it again, especially the description of the possible return values from xa_commit(). Done? Good. Now you know that the error codes fall into distinct groups: those that communicate a definitive outcome (e.g. 'I rolled this back. It's finished.') and those that don't ('I may or may not have committed this.')  The former group is terminal and retrying is pointless - you'll just get an XAER_NOTA or other error from the RM. The latter group MAY be recoverable. As I've described previously, it's possible that either a) the db driver is returning a terminal code instead of a retry or b) the transaction manager is getting a retryable code but treating it as terminal. To figure out which I need the exception information previously requested.

                            • 41. Re: Jboss transaction recovery issue
                              scarceller

                              jhalliday wrote:

                               

                              "In the end the only 100% certain way for the TM to know the 2nd DB commited is if it gets a GOOD Commit Reply from the DB. If while trying to commit it gets a timeout or any other ERROR reply (not GOOD reply) then the TM simply needs to retry the commit at a later time"

                               

                              That's utter nonsense. Read the XA spec. Now read it again, especially the description of the possible return values from xa_commit(). Done? Good. Now you know that the error codes fall into distinct groups: those that communicate a definitive outcome (e.g. 'I rolled this back. It's finished.') and those that don't ('I may or may not have committed this.')  The former group is terminal and retrying is pointless - you'll just get an XAER_NOTA or other error from the RM. The latter group MAY be recoverable. As I've described previously, it's possible that either a) the db driver is returning a terminal code instead of a retry or b) the transaction manager is getting a retryable code but treating it as terminal. To figure out which I need the exception information previously requested.

                              Doesn't XAER_NOTA simply mean the xid is not valid in the given resource? if so this implies the xid is not in-doubt, am I wrong here? What I have is a in-doubt XID in the Oracle DB it is seen by running 'select * from dba_2pc_pending' so it's not the case that the DB took some sort of action on this xid. The DB has not in any way commit or rolledback this xid. It has left it for the TM to instruct it what to do.

                               

                              I understand that under certain cases the xid may have been resolved by the DB on it's own. But this is not the case in my scenario. My understanding of a Heuristic RM decision is when the RM decides on it's own to do something without being told. But my Oracle DB will not commit or rollback once told to prepare as doing so could cause the data to be out of sync with 1st DB. I'm 100% certain the test case causes Oracle to have in-doubts and these are not in some unknown state. They are simply waiting forever to be told what to do and if the TM where to ask this in-doubt xid to commit or rollback the DB would not error out. We could be simply taking about 2 diffrent things here. Keep in mind the log I sent you was from a heavily loaded run and within that run we could have some xids that that did rollback because they never entered prepare. In this case some TMs later try xid and get a Hueristic outcome because the Oracle DB already rolled them back.

                               

                              In summary, aren't Hueristic outcomes only possible when the RM decides to take action on it's own? and thus the xid is no longer valid? My understanding is you don't get Huristic outcomes on a xid that's still valid and in an in-doubt state.

                              • 42. Re: Jboss transaction recovery issue
                                jhalliday

                                > The DB has not in any way commit or rolledback this xid. It has left it for the TM to instruct it what to do.

                                 

                                The TM did instruct it to commit. The DB failed to do so. What I'm trying to pin down is the exact nature of that failure. It is possible the DB failed in such as way as to give the TM the impression it had terminated the transaction branch. It is equally possible it failed in such a way as to indicate the matter was still unresolved. The difference is critical.

                                • 43. Re: Jboss transaction recovery issue
                                  scarceller

                                  jhalliday wrote:

                                   

                                  > The DB has not in any way commit or rolledback this xid. It has left it for the TM to instruct it what to do.

                                   

                                  The TM did instruct it to commit. The DB failed to do so. What I'm trying to pin down is the exact nature of that failure. It is possible the DB failed in such as way as to give the TM the impression it had terminated the transaction branch. It is equally possible it failed in such a way as to indicate the matter was still unresolved. The difference is critical.

                                  OK, got it.

                                   

                                  But this will take some work on my end because my test case is under very heavy load and I'll need to look through tons of log entries. Plus I don't have full debug turned on during the test because I need max through put. I only turn on full TRACE after I see in-doubts are not being resolved in the 2nd DB.

                                   

                                  After giving this issue thought I have a very dumb question I hope you can answer:

                                   

                                  The TM has 3 recovery modules:

                                  1 - AtomicActionRecoveryModule

                                  2 - TORecoveryModule

                                  3 - XARecoveryModule

                                   

                                  Is the XARecoveryModule the only one that is linked to the AppServerJDBCXARecovery code? meaning is this recovery module the only one that can build recovery connections to the DBs via the AppServerJDBCXARecover?

                                   

                                  Are the other 2 modules even capable of getting a connection to the DBs if the AS is simply re-started?

                                  More importantly can the AtomicActionRecoveryModule get a connection to the DB for something in the ObjectStore that no longer has it's original transaction in flight (like after a cold re-start)?

                                   

                                  Thanks.

                                  • 44. Re: Jboss transaction recovery issue
                                    jhalliday

                                    > this will take some work on my end because my test case is under very heavy load and I'll need to look through tons of log entries.

                                     

                                    Well you only need one transaction to reproduce the issue. Breakpoint the code before the commit, pull the plug on the db and step the debugger.  Or use your existing test and grep the logs on "com.arjuna.ats.internal.jta.resources.arjunacore.commit" which is what's used by the exception logging in the commit wrapper. It's not even a trace level output - it's visible at WARN.

                                     

                                    > Isthe XARecoveryModule the only one that is linked to theAppServerJDBCXARecovery code? meaning is this recovery module the onlyone that can build recovery connections to the DBs via theAppServerJDBCXARecover?

                                     

                                    yes, the app server code is basically a plugin for the XARecoveryModule. The other rec modules don't instantiate XAResources directly - they rely on XARecoveryModule to do it for them. Hence they don't need plugins.