1 2 3 4 5 6 Previous Next 80 Replies Latest reply on Jan 27, 2010 4:51 PM by marklittle Go to original post
      • 60. Re: Jboss transaction recovery issue
        scarceller

        Mark,

         

        Thanks for the details.

         

        I think                                      shaohua qu has the simplist test case as he is setting break points in the code and doing testing for a single transaction. I think he and I may be seeing the same situation. However, his test case is much simpler and he should be able to get the XAException.XAER_?? very easily. I'm going to re-run my test and look for what XAExceptions.XAER_?? I may be getting shortly after the network pull but my runs are under heavy load but I know what to search for in the log. I hope to do this early next week.

         

        But in the mean time I'd really like to see what                                       shaohua qu is getting as well. I asked him this in the prior post above this one.

        • 61. Re: Jboss transaction recovery issue
          marklittle
          No problem Sal. Just provide the information whenever you can.
          • 62. Re: Jboss transaction recovery issue
            scarceller

            mmusgrov wrote:

             

            scarceller wrote:

             

            Any tips on starting the Java Swing GUI to inspect the tran logs?

            I think this tool could really be useful.

             

            Thanks.

             

            There should be a file called INSTALL in the distribution that describes how to install the product. The last section says how to run the embedded tools (basically the tools are packaged as a sar file - just drop the sar into the app server deploy directory and then navigate to the EmbeddedTools service in the JMX console). The JBossTS core manual shows how to use it but it is quite straight forward - it shows the Object Store hierarchy as a tree view from which you can drill down to see details of individual transactions.

             

            Note that this tool is provided as is. We are working on providing similar functionality as a JMX MBean so the Swing version of the tool will eventually be removed.

            I had issues installing the jbossts-tools.sar - I simply dropped it into the deploy dir but then when trying to launch it I was getting file not found exceptions. It was looking for files in a ..\default\tmp\jboss_tools_tmp but this dir did not exist. So I simply created the dir and manually unpacked the jbossts-tools.sar files to that dir.

             

            Now the tool starts but I'm not sure if it is fully working?

             

            It shows me the ObjectStore and I see my xid in both:

            Recovery->TransactionStatusMgr->-3f57994f:e14:4b59d62f:0

            StateManager->BasicAction->TwoPhaseCoordinator->AtomicAction->-3f57994f:c87:4b59cb0e:451e

             

            But I can't drill down any further than this. I'd like to see the details of those entries. Also what do these parts ffffffff:fff:ffffffff:ffff of those IDs mean? Can I tell what state things are in? Like in need of Commit, Rollback or Manual help?

             

            Thanks.

            • 63. Re: Jboss transaction recovery issue
              mmusgrov

              Selecting HashedActionStore from the combo box it should look something like:

               

              StateManager
                BasicAction
                  TwoPhaseCoordinator
                      AtomicAction
                          -3f57994f:c87:4b59cb0e:451e
                              Tx Info
                              Prepared List
                              Pending List
                              Heuristic List
                              Failed List
                              Read-only List

               

              Then the tool would be able to show what state the various participants of the transaction are in. For example clicking on "Heuristic List" would show the heuristically completed particpants on the top right pane. Then clicking on one of those would let you do things like commit/rollback/forget.

               

              So your not getting any of that?

              What do you see in the top right pane if you click on the transaction identifer?

              Did you mention which version of JBossTS you are using?

              • 64. Re: Jboss transaction recovery issue
                scarceller

                First, as I mentioned dropping the .sar into deploy did not work.

                 

                Version is JBossEAPV5.0

                 

                I do not see any of the stuff in bold

                StateManager
                  BasicAction
                    TwoPhaseCoordinator
                        AtomicAction
                            -3f57994f:c87:4b59cb0e:451e
                                Tx Info
                                Prepared List
                                Pending List
                                Heuristic List
                                Failed List
                                Read-only List

                 

                But it may well be related to the deploy issues I had for the jbossts-tools.sar.

                 

                Do you have a new version of the tools.sar?

                Or any suggestions for what may be wrong with the deploy of the .sar I currently have that came with JBossEAPV5.0?

                 

                Thanks.

                • 65. Re: Jboss transaction recovery issue
                  marklittle
                  Just a thought, but if you are using a licenced version of EAP 5.0 then you may want to use the CSP for raising this issue further. You should get better turn-around time.
                  • 66. Re: Jboss transaction recovery issue
                    mmusgrov

                    I don't know why the sar fails to deploy - the unpack depends on the existence of the JBoss AS tmp directory which (I'll try it out on a fresh EAP release) but the work around you applied was good.

                     

                    And it's odd that that the tool isn't displaying the various intentions lists (these lists determine what the TM intends to do with the various participants of the transacton). Would it be possible for you to attach a copy of the log store which should be located in <server>data/tx-object-store

                    • 67. Re: Jboss transaction recovery issue

                      My test case was attached with my first orginal message. Can not you see?

                      In addition, although the similar question only introduced by Sal and I, it can't porve that other people didn't have the similar question, they just didn't post it to this forum.

                      I am very grateful for your and everyone's efforts for this question.

                      • 68. Re: Jboss transaction recovery issue

                        The exception is oracle.jdbc.xa.OracleXAException:

                        oracle.jdbc.xa.OracleXAException
                         at oracle.jdbc.xa.OracleXAResource.checkError(OracleXAResource.java:938)
                         at oracle.jdbc.xa.client.OracleXAResource.commit(OracleXAResource.java:504)
                         at com.xxxxx.swat.txn.sample.resource.OracleResource.commit(OracleResource.java:116)
                         at org.omg.CosTransactions.ResourcePOA._invoke(Unknown Source)
                         at org.jacorb.poa.RequestProcessor.invokeOperation(Unknown Source)
                         at org.jacorb.poa.RequestProcessor.process(Unknown Source)
                         at org.jacorb.poa.RequestProcessor.run(Unknown Source)


                        I know the transaction rolled back because two aspects:

                        1) From database, I found the data was not changed.

                        2) From Jboss TS's source code, I found this:

                         

                                     if (!transactionLog((Xid) xids[j]))
                                            xares.rollback((Xid) xids[j]);
                                     else
                                     {
                                       /*
                                        * Ignore it as the transaction system
                                        * will recovery it eventually.
                                        */
                                      }
                        • 69. Re: Jboss transaction recovery issue
                          marklittle
                          Two things: first I didn't realize you were using the JTS implementation. Try with the JTA for a start: we're looking for the error code that the XAResource returns during commit; although it'll be available in the JTS I want to make sure we're not fighting an incorrectly configured ORB or something else. Second, we've already covered the code snippet you included several pages back.
                          • 70. Re: Jboss transaction recovery issue
                            scarceller

                            mmusgrov wrote:

                             

                            I don't know why the sar fails to deploy - the unpack depends on the existence of the JBoss AS tmp directory which (I'll try it out on a fresh EAP release) but the work around you applied was good.

                             

                            And it's odd that that the tool isn't displaying the various intentions lists (these lists determine what the TM intends to do with the various participants of the transacton). Would it be possible for you to attach a copy of the log store which should be located in <server>data/tx-object-store

                            Post #32 has an example ObjectStore directories with 3 in-doubts.

                             

                            I don't think this is a ObjectStore issue, I think it's some class missing (can not be found) when I try to drill into the ObjectStore record details. I'll try to get you the exception information from the log. If I recall the exception says something about BrowserFrame exception.

                             

                            Thanks.

                            • 71. Re: Jboss transaction recovery issue
                              marklittle
                              I knocked up a command-line browser over the weekend. Once I get a chance to tidy it up I will commit it to trunk and you can try to build it independently for the version of TS you're using.
                              • 72. Re: Jboss transaction recovery issue
                                scarceller

                                mark.little@jboss.com wrote:

                                 

                                OK let's start this by working under the assumption that the scenario Andrew outlined is the one that matches your case. If it isn't or, say, the XAResource is returning XA_RETRY, then we either need to clarify the scenario or we have a problem elsewhere.

                                 

                                Next, let's not use the "million flies" argument. Yes it may well be the case that other application servers you've tested "get this right", but that doesn't mean it's the right thing for them to do. However, neither does it mean that there is no issue in our code, which is why we're trying to get to the bottom of this. Having information about how others behave is a good data point, but it doesn't necessarily indicate a solution.

                                 

                                With that said, let's agree that heuristic outcomes are bad. They break the A in ACID and cannot be resolved automatically. The fact that something could try to do so is actually a bad thing: a heuristic outcome means that the RM did something that went against the true outcome of the transaction. It could have happened just now, or it could have happened hours ago. In either case it's possible that some other application could then have made decisions based on the data that this RM did (or didn't) commit, when in fact that data is erroneous. In which case we're potentially in a cascading rollback scenario, where in order to resolve the issue we (or something) have to chase down numberous applications and correct their data as well. So a coordinator opaquely hiding heuristic decisions is not doing you your applications or your data any favours.

                                 

                                What you'll find throughout the JBossTS codebase is that we try very hard to fail safe and avoid heuristics if at all possible. So for instance, if the first resource we tell to commit throws a heuristic rollback then we'll move the transaction into a rollback phase and try to rollback all of the other participants so that the outcome of them all is rollback, thus avoiding the heuristic outcome. But in same cases it's not possible and where there's any doubt as to the outcome of the transaction we try to give as much information as possible and let the administrator take over where it makes sense (hopefully you'll see that resolving heuristics really needs an understanding of the semantics of the application.)

                                 

                                So if we look at what we do when a crash occurs during the commit call on XAResource, the above may make sense. I'll go through each of the error codes that can legally be placed within the XAException and explain why we do what we do. Then you can maybe let me know what the other implementations are doing differently.

                                 

                                • XAException.XA_HEURHAZ: well this one is easy ;-)
                                • XAException.XA_HEURCOM: ditto.
                                • XAException.XA_HEURRB, XAException.XA_RB*: the resource has rolled back, so we're in trouble.
                                • XAException.XAER_RMERR and XAException.XAER_PROTOthe XA specification is pretty clear on these and that we have to consider them to have rolled back.
                                • XAException.XA_HEURMIX: another easy one.
                                • XAException.XAER_NOTA: if we get this during recovery then we assume a previous call to commit worked and ignore. Otherwise the RM is saying that it doesn't know about a transaction we know about so there's a discrepancy there and we fail safe and assume a hazard.
                                • XAException.XA_RETRY: here we could try again immediately, but instead we rely on recovery to kick in periodically when it checks the log. If your RM is returning this and we are not replaying the transaction on this RM then there is an issue.
                                • XAException.XAER_INVAL and XAException.XAER_RMFAIL: there's another potential discrepancy here and no guarantee that retrying will do any good. Therefore we assume a hazard and let the admin tidy this up. Now there's an argument to be had that in some situations it may be OK to retry on an XAER_RMFAIL, but if that's the case then XA_RETRY really could have been thrown in the first place.

                                 

                                And that's it. No other error codes are permitted by the standard.

                                 

                                Therefore this comes down to what error code is your RM returning and under what situation? If you can let us known then maybe we can suggest a solution. Then again, as I said at the start, maybe this isn't your scenario and we're looking at the wrong area of the code.

                                 

                                Mark,

                                 

                                I ran 2 tests to get the exception I'm getting on the  XAResourceRecord.commit() when the DB is disconnected just prior to the commit():

                                 

                                First test:

                                Add to DB2 first then Oracle

                                Pull network on the Oracle DB prior to commit()

                                Oracle RM returns this exception:

                                     ------------------------

                                     2010-01-25 14:02:19,718 WARN  [com.arjuna.ats.jta.logging.loggerI18N] (http-0.0.0.0-8080-7)      [com.arjuna.ats.internal.jta.resources.arjunacore.commitxaerror] [com.arjuna.ats.internal.jta.resources.arjunacore.commitxaerror]      XAResourceRecord.commit - xa error XAException.XAER_RMFAIL

                                     ------------------------
                                     Oracle throws XAER_RMFAIL when network gets disconnected

                                And you said a XAER_RMFAIL will result in marking this xid as hazard. This means TM won't tidy things up. Simply the in-doubt remains in the Oracle DB forever as the TM won't fix it.

                                 

                                Second test: this one is interesting

                                Add to Oracle first then DB2

                                Pull network on the DB2 DB prior to commit()

                                DB2 RM returns this exception:

                                     ---------------------------

                                     2010-01-25 14:32:51,625 WARN  [com.arjuna.ats.jta.logging.loggerI18N] (http-0.0.0.0-8080-39)      [com.arjuna.ats.internal.jta.resources.arjunacore.commitxaerror] [com.arjuna.ats.internal.jta.resources.arjunacore.commitxaerror]      XAResourceRecord.commit - xa error XAException.XA_RETRY

                                     ---------------------------
                                     DB2 throws XA_RETRY when network gets disconnected

                                And you said a XA_RETRY leaves the xid in retry state so the TM could resolve it.

                                However, the in-doubt remains in DB2 and the TM never commits it.

                                 

                                In summary I have both types of exceptions: Oracle=XAER_RMFAIL and DB2=XA_RETRY but neither case results in a resolved in-doubt not even DB2.

                                 

                                Thoughts?

                                • 73. Re: Jboss transaction recovery issue
                                  marklittle
                                  Sal, I think the first thing you need to do is see if the log browser Michael mentioned will run, then you can take a look at the logs and confirm their internal status. Certainly I'd expect recovery to retry in the second case. I'll look at the code later today as well.
                                  • 74. Re: Jboss transaction recovery issue
                                    scarceller

                                    mark.little@jboss.com wrote:

                                     

                                    Sal, I think the first thing you need to do is see if the log browser Michael mentioned will run, then you can take a look at the logs and confirm their internal status. Certainly I'd expect recovery to retry in the second case. I'll look at the code later today as well.

                                    Mark,

                                     

                                    First, the OjectStore browser comes up but I can't drill all the way down into the xid that needs help. The browser throws a class missing exception when I try to drill into the xid. I can see all the xids that need help but can't drill into them to see status.

                                     

                                    Then, the Oracle case is fully understood in that it throws a XAER_RMFAIL and once that happens the TM sets it aside marked as Hazard.

                                     

                                    The DB2 case is very interesting: when the commit() fails because the network has gone down it gets a XA_RETRY and it does retry it. But once I reconnect DB2 to the network the very next scan by the TM causes the commit() to re-try but it gets a XAER_RMFAIL on this retry. This makes no sense, I also looked at DB2 and it doesn't seem that the request even made it to DB2. Is it possible that the RM throws a XAER_RMFAIL on it's own without even making it to DB2? Keep in mind that I never shutdown the AppServer this means the JDBC Connection pool can (and most likely has) have stale connections because of the network outage. Bottom line is that on the reconnect and retry pass the RM throws XAER_RMFAIL and once this happens we are back to setting the XID aside marked as Hazard. I'm now trying to watch very closely to see what's going on durning the retry of the commit to DB2. Any thoughts on howto figure out exactly why and who causes the XAER_RMFAIL?

                                     

                                    I attached a log from the DB2 case, the entire log is complete from server startup to take down. I also put tons of comments in the log (search for "<- Sal,"). This run caused 2 in-doubts in DB2 that need to be commited. If you also look for "] XAResourceRecord.commit" you will find 4 of these exceptions: the first 2 are the XA_RETRY then the next 2 are XAER_RMFAIL during recovery. The attached zip has the log as well as the ObjectStore as it was at the end of the run after the AS was stopped gracefully.

                                     

                                    Thanks.