13 Replies Latest reply on May 21, 2010 2:22 AM by aloubyansky

    first stax performance test

    aloubyansky

      I've written a simple StAX-based parser for XB to see how it affects the performance. I've used the same tests I've been using previously while working on optimizations and comparison to JAXB.

       

      I'd like to mention that XB testsuite with StAX parser has a few failures related to required attributes with default values and tests related to XInclude. In JBoss AS both of these are important.

      As to the default attributes, if they are missing from the xml, the parser is supposed to silently add them in. But this happens (with SAX at least) only if the validation is enabled. The StAX impl I've used for testing (which is included in JDK6) doesn't seem to support validation (setting isValidating proprty to true fails).

      XInclude is not supported. But if necessary, I guess, we implement it on top.

       

      The results show the average time (in ms) of 10 independent re-runs and include the time for the first test run (includes parser initialization, classloading, etc) and subsequent 1000 runs.

       

      XB with SAX (custom written handlers, XInclude disabled, property replacement disabled, non-validating)

      first run: 127.1

      next 1000 runs: 892.4

       

      XB with SAX (default handlers, XInclude disabled, property  replacement disabled, non-validating)

      first run: 127.8

      next  1000 runs: 997

       

      JAXB (XInclude disabled, non-validating)

      first run: 119.8

      next 1000 runs: 1088.4

       

      XB with StAX (custom written handlers, property replacement  disabled)

      first run: 22.6

      next 1000 runs: 1015.3

       

      XB with StAX (default reflection-based handlers, property replacement disabled)

      first run: 24.8

      next 1000 runs: 1102.1

       

      The initialization with StAX is unbeatable. But then actually parsing and unmarshalling isn't showing that well. Just to be clear, what I did is just switched from SAX events to StAX, the parser looks very simple (create stream and pull events), I don't see potentially significant optimizations there.

        • 1. Re: first stax performance test
          aloubyansky

          BTW, within the AS, in case of the SAX-based parser, the SAX parser factory will be created and initialized just once. All the XML files except the first one will benefit from the initialization during the first XML parsing.

          • 2. Re: first stax performance test
            jason.greene

            Hi Alexey,

             

            Which StAX impl are you using? In the past I found woodstox to be much faster than what is bundled in the JDK.

            • 3. Re: first stax performance test
              aloubyansky

              It was the one from Sun's jdk6 on windows. I'll try woodstox.

              • 4. Re: first stax performance test
                aloubyansky

                I've tested the latest woodstox 4.0.8. And also have found a simply stupid bug in my previous test: XMLInputFactory was an instance variable instead of being a class variable. Sorry, that makes the previous tests not fare wrt StAX.

                I've re-run the tests only for XB just switching the parsers: SAX, JDK StAX, Woodstox.

                For all the parsers validation was disabled, namespace awerness enabled, XInclude disabled.

                XB features: property replacement disabled, default reflection-based handlers.

                 


                JDK6 SAXJDK6 StAX
                Woodstox
                first run (in ms)119.921.879.5
                next 1000 runs (in ms)939.2546463.3

                 

                Woodstox takes longer to initialize but the total time (first plus subsequent 1000 runs) is still slightly better than the JDK's StAX (542.8 vs 567.8).

                 

                WRT the XB testsuite, Woodstox showed the same failures as the JDK6 StAX, i.e. no support for default attribute values (same reason, i.e. schema validation is not supported) and XInclude.

                Woodstox does support validation but only for DTD, which is not that interesting now.

                 

                It does make sense to look into switching to a StAX impl, which means the limitations above will have to be addressed. Default attribute values could be specified with Java binding annotations or initialized manually. But XInclude still has to be implemented. I'm gonna look into that.

                 

                Another thing, entity resolution (which is in JBossEntityResolver) will have to be adapted as well. StAX API, of course, doesn't use SAX's EntityResolver and InputSource.

                • 5. Re: first stax performance test
                  jason.greene

                  Wow, they have really improved StAX JDK performance. It might not be worth using woodstox after all.

                  • 6. Re: first stax performance test
                    aloubyansky

                    Actually, I had to continue with Woodstox. JDK's impl doesn't give me full DTD info. It does report the DTD event but I couldn't get publicId/systemId. But using Woodstox's API I could get to it. It's really necessary for the metadata and deployers. I got all the metadata tests passing and the AS booting.

                    We actually don't have any XML at the moment in the AS with XInclude. But I am sure there was something in MC using XInclude.

                    • 7. Re: first stax performance test
                      dmlloyd

                      Alexey Loubyansky wrote:

                       

                      Actually, I had to continue with Woodstox. JDK's impl doesn't give me full DTD info. It does report the DTD event but I couldn't get publicId/systemId. But using Woodstox's API I could get to it. It's really necessary for the metadata and deployers. I got all the metadata tests passing and the AS booting.

                      We actually don't have any XML at the moment in the AS with XInclude. But I am sure there was something in MC using XInclude.

                       

                      Wow, I'm surprised at this.  Are you using the stream reader or event reader interface?  Stream reader might be better (and, it might give access to the publicId/systemId stuff by way of getPIData or something like that).

                      • 8. Re: first stax performance test
                        aloubyansky

                        I originally used stream readers. And the test results above are for stream readers. But then I needed to get the DTD publicId/systemId and with stream readers I couldn't. Although I haven't tried getPIData(), I assumed it's for processing instructions. In the the standard StAX API there is general getText() and in case of JDK impl it returns something like "couldn't get the DTD info" and in case of Woodstox - just an empty string.

                        The only way I've found to get publicId/systemId so far is by using event readers and Woodstox-specific API.

                         

                        I haven't run the performance comparison tests for streams vs events yet as I wanted to make it work for the metadata and the AS first.

                        • 9. Re: first stax performance test
                          aloubyansky

                          Here are the results of Woodstox evet readers vs stream readers (running the same tests as above against XB)

                           

                           


                          event readers
                          stream readers
                          first run (in ms)101.685.8
                          next 1000 runs (in ms)541.2464.8

                           

                          And here is the average of 10 AS 6 trunk start-ups comparing Woodstox event readers against SAX:

                          Woodstox: 18223.2 ms

                          SAX: 19038.3 ms

                           

                          The results of the AS start-ups were very inconsistent. I.e. (I actually have booted the AS many more times) sometimes SAX would boot in 17 sec + something and Woodstox in 22 sec + something. But still on average the difference is probably close to the one above.

                          When I was testing the start-up I was offline, no anti-virus or other heavy processes. It's on windows vista.

                          • 10. Re: first stax performance test
                            dmlloyd

                            Alexey Loubyansky wrote:

                             

                            The only way I've found to get publicId/systemId so far is by using event readers and Woodstox-specific API.

                            You might be able to set an XMLResolver on the XMLInputFactory and use that to detect the publicId/systemId?

                            • 11. Re: first stax performance test
                              aloubyansky

                              I actually do set an XMLResolver. But it's a kind of a separate component, which actually wraps current JBossEntityResolver with all the registered schemas/dtds.

                              But, yes, it might be a good idea. I'll look into it.

                              • 12. Re: first stax performance test
                                dmlloyd

                                Alexey Loubyansky wrote:

                                 

                                I actually do set an XMLResolver. But it's a kind of a separate component, which actually wraps current JBossEntityResolver with all the registered schemas/dtds.

                                But, yes, it might be a good idea. I'll look into it.

                                 

                                It's probably OK to use the woodstox API too, I was just surprised that such a (seemingly basic) thing is so hard to do with a basic StAX spec implementation.

                                • 13. Re: first stax performance test
                                  aloubyansky

                                  The API is ok, i.e. it provides the methods to get the information but the impl doesn't.