8 Replies Latest reply on Nov 21, 2011 8:14 AM by rhauch

    Next Generation ModeShape

    rhauch

      In March 2008 we announced a new project called 'JBoss DNA' whose purpose was to be a new JCR repository implementation, and one that offered new ways of storing and federating content. Since repositories often store files and metadata, JBoss DNA would also automatically examine those files, extract useful information, and store that derived information back in the repository. The project would also leverage other JBoss.org technologies, including Hibernate, JBoss Cache, JGroups, and the JBoss Application Server. We released seven minor releases of JBoss DNA.

       

      In March 2010, the project was rebranded and the 1.0 release was issued, with support for all of the required and many of the optional parts of the JCR 1.0 API; this was followed by a few minor releases. Support for JCR 2.0 (JSR-283) came a few months later with the release of ModeShape 2.0 in July 2010. This was a pretty straightforward upgrade for our users, since JCR 2.0 was largely an expansion of functionality (with only a handful of methods that were deprecated without some form of replacement). And during the next 15 months, we continued releasing new minor versions with bug fixes, new features, and performance improvements. ModeShape is more stable and faster than ever.

       

      But we want to and can do more. We want ModeShape to be the fastest, most scalable, and most available JCR implementation there is. We want ModeShape to support large numbers of concurrent clients. We want ModeShape to support flatter hierarchies as good and fast as deeper ones. We want ModeShape to be much easier to configure, deploy, monitor, and manage. We want seamless integration with JBoss AS7 to make it trivial to use JCR within your web apps. We want to support XA transactions and to participate in distributed transactions. We want to open the door to other kinds of persistent storage, including those that are eventually consistent. We want ModeShape to scale to massively large repositories and even very large clusters.

       

      These goals are not small steps forward. No, they're major leaps beyond where we are now. We'll obviously keep the standard JCR API as our public API, so that will remain consistent for client applications. But we're actually at a perfect place to start considering some significant changes to the way ModeShape works under the covers. With ModeShape 3.0, we have the opportunity to change and improve how ModeShape stores its content, how the sequencers work, how and where our indexes are stored, and how content is cached. We can continue to provide fixes on the 2.x branch while we work and stabilize the 3.x branch. And no matter what, we must commit to making it very easy to migrate a 2.x installation to a 3.0 installation.

       

      I've been working four to five months on considering ways we can achieve these goals. Approaches that are more incremental reduce the effort and risk, but make it more difficult to achieve the goals. However, one approach was so promising that I thought it warranted a full-blown prototype to test it out, and the more I implemented the more promising it looked. Recently I completed enough of it that I could run some performance comparisons using the JCR API. I was floored by the results, even after spending no time optimizing the system:

       

      • Time to create 10K child nodes, perform a save, and persist to disk: 0.8 seconds.
      • Time to create a subgraph with 1.01M nodes, periodically saving, and persisting to disk: 3.5 seconds
      • Time to get a node by path from a workspace with 1M nodes: 0.00058 seconds

       

      Running the same operations in ModeShape 2.x takes significantly longer (in some cases multiple orders of magnitude!), so I think this new design shows tremendous promise.

       

      How does the new approach work? It takes ModeShape's existing JCR implementation and puts it on top of a simple framework of storing each node as a JSON/BSON document inside Infinispan, and using Infinispan's cache loaders for persistence. It uses Infinispan both as a distributed store and a distributed cache, so accessing nodes is quite efficient and reading from persistent storage only happens when/if the cache has previously purged that value from memory. Infinispan also supports XA transactions, making it far easier for ModeShape to support them. Infinispan is a data grid that has three modes (local, replicated, and distributed ) to dictate how/whether information is copied across the data grid, making it possible to use Infinispan as a massive, distributed and available in-memory heap where all content can be kept in-memory. By using Infinispan this way, a cluster of ModeShape processes effectively becomes a "content grid".

       

      What does this mean for ModeShape's JCR implementation layer? Most importantly, we're not starting from scratch. The JCR implementation classes in the 2.x codebase have a lot of logic in them to properly implement the JCR specification, but they use the internal graph API and cache mechanism to do this. We'll keep all that logic but will refactor the classes to use the new document-oriented approach on top of Infinispan, and this means that the overall risk of this major refactoring is significantly lower than a clean-sheet rewrite.

       

      What does this mean for ModeShape's connectors? ModeShape's Infinispan and JBoss Cache connectors are no longer needed with this new architecture, simply because Infinispan plays such an important role in the new approach. Infinispan already has a number of cache loaders that can cached content (a number of different ways), so some of our connectors can effectively be replaced with existing Infinispan cache loaders. For example, the JDBC cache loaders would work well in place of our JPA connector. Other connectors might not have a direct analogy, but may actually be replaced with a cache loader that has more capability. For example, Infinispan has a file system cache loader that's not transactional, and so could be used in place of our disk-based storage connector. But the BerkleyDB cache loader may actually be a far better and faster replacement that supports transactions. Plus, there are other cache loaders that offer functionality we didn't have before, including JClouds and Cassandra. The federation connector will not be needed, due to built-in support for federation within the new architecture. We can develop ModeShape-specific cache loaders to support the functionality provided by the remaining connectors (e.g., SVN, file system, JDBC metadata). If needed, we can even provide an Infinispan cache loader that works with the database schema used by the 2.x JPA connector.

       

      What does this mean for ModeShape's sequencers? We'll likely change the sequencers to directly use the JCR API, making it much easier for developers already familiar with JCR to write new custom sequencers. We'll also deprecate the existing sequencer API and provide an adapter that can be continue to use sequencers that use the older API, giving a few releases for people to convert their existing custom sequencers.

       

      What about my existing ModeShape repositories? As mentioned above, we want to make it very easy for you (and your customers) to migrate from ModeShape 2.x to 3.x, so we'll be providing utilities that will help convert 2.x configuration files to the newer format and will help migrate the content from the existing stores into the new Infinispan data grid.

       

      Where do we go from here? The next steps are to move this prototype into a branch in the ModeShape Git repository, and to complete its features and continue testing it. We'll want to add more performance tests to verify that this is heading in the right direction and to be able to measure and compare the performance relative to ModeShape 2.x and the reference implementation.

       

      What's the schedule? I would like to issue the first 3.0.0.Alpha1 release within a few weeks, and issue a second Alpha release about 2-3 weeks later. We'll switch to Beta releases as soon as the JCR and ModeShape 2.x features are done, and continue those while we iron out the bugs. When we're confident that ModeShape 3 is functioning correctly, passing the TCK tests, performing very well, and passing all of our unit and integration tests, we can then issue a Candidate Releases and move quickly towards a final release.

       

      What can you do? As always, we welcome anyone that wants to contribute. If you're primarily interested in testing and using ModeShape 3, please follow along until we start issuing the Alpha, Beta and Candidate releases, and please start testing those releases and filing JIRA issues for any bugs that you've found. If, however, you want to get more involved and help implement ModeShape 3 and/or write tests, please let us know! There's lots of exciting and fulfilling work to do.

       

      Where can I learn more? I've attached a PDF file that contains some rough documentation, written from the perspective of a user trying to use ModeShape 3. It's not really what we'll use for documentation, but it's a start at outlining the design of this new approach and what it means for users.

       

      Finally, stay tuned, because tomorrow I'll push out the branch with the code.

        • 1. Re: Next Generation ModeShape
          kbachl

          Sounds very interesting, just some short questions:

           

          1. Will ModeShape 3.0 still be 100% JCR 1.0 and JCR 2.0 compatible?
            (I've only seen a mention of JCR 2.0 which would be a problem as they stripped out wrappers from JCR 1.0 and this hurts us a bit)
          2. Has the problem regarding infinispan and programmatically created workspaces solved?
            (In ModeShape 2.X one can't use InfiniSpan if he has to create workspaces programmatically as these don't get distributed by infinispan as ModeShape would require them)

           

          Best

          • 2. Re: Next Generation ModeShape
            rhauch
            1. Will ModeShape 3.0 still be 100% JCR 1.0 and JCR 2.0 compatible?

            ModeShape 2.x passed the JCR 2.0 TCK in all but a few tests, and some of those were test errors. But we do want to ModeShape 3.0 to pass 100% of the TCK for JCR 2.0.

             

            Strictly speaking, I don't think ModeShape will work with the JCR 1.0 API, because it is built against the JCR 2.0 API which has additional methods in the interfaces. (This is likely true for the reference implementation, too.) However, clients written to JCR 1.0 should continue work with ModeShape 2 and 3, except in any cases where JCR 2.0 changes the behavior.

             

            2. Has the problem regarding infinispan and programmatically created workspaces solved? (In ModeShape 2.X one can't use InfiniSpan if he has to create workspaces programmatically as these don't get distributed by infinispan as ModeShape would require them)

             

            Yes, this has been fixed in Infinispan 5.1. ModeShape 3 will use a single Infinispan cache for all workspaces in a single repository. But because it will be possible to creating new repositories in ModeShape, we'll definitely use this feature of Infinispan 5.1.

            • 3. Re: Next Generation ModeShape
              rhauch

              The '3.x' branch is now available in the official Git repository on GitHub, and it's been populated with the start of the work. This branch essentially becomes the 'master' branch for all the remaining 3.0 work.

              • 4. Re: Next Generation ModeShape
                kbachl

                Hello Randall,

                 

                in 1289 you write:

                 

                There are several new Maven modules:

                - modeshape-jcr-redux

                - modeshape-schematic

                 

                The 'modeshape-jcr-redux' module will eventually replace the 'modeshape-jcr' module once

                the implementation is far-enough along. And the 'modeshape-schematic' module will likely

                move into the Infinispan project, so that needs to remain separate.

                 

                Could you please be so kind and give me a hint what exactly jcr-redux will be? - Is it a way to come up with a new JCR layer to replace the current one?

                 

                You also write that modeshape-schematic will go into infinispan - what exaclty is in it that infinispan will need? (sorry for the maybe dumb-question, but I just wonder)

                 

                PS: regarding performance - i dont know how much faster it can be but using disk-connector and its inmemory cache in 2.6.0.Final the performance is already superb (ok, it allways can get faster )

                • 5. Re: Next Generation ModeShape
                  rhauch

                  in 1289 you write:

                   

                  There are several new Maven modules:

                  - modeshape-jcr-redux

                  - modeshape-schematic

                   

                  The 'modeshape-jcr-redux' module will eventually replace the 'modeshape-jcr' module once

                  the implementation is far-enough along. And the 'modeshape-schematic' module will likely

                  move into the Infinispan project, so that needs to remain separate.

                   

                  Could you please be so kind and give me a hint what exactly jcr-redux will be? - Is it a way to come up with a new JCR layer to replace the current one?

                   

                  "modeshape-jcr-redux" is just a temporary home for the 3.x JCR codebase, as we move functionality from "modeshape-jcr" into "modeshape-jcr-redux". This allows us to keep the "legacy" code in the branch, while still *adding* the new JCR approach. We can migrate the higher-level components (e.g., the web, utils, sequencers, extractors) to use this new "modeshape-jcr-redux" version, and do this one-by-one (incrementally) rather than having to do it all at once. Then when the "modeshape-jcr-redux" is stable enough, we'll remove the "modeshape-jcr", "modeshape-cnd", "modeshape-graph", and "modeshape-repository" modules, and rename "modeshape-jcr-redux" to "modeshape-jcr".

                   

                  Hopefully all this will happen before Alpha1, but it definitely will happen before Beta1.

                   

                  You also write that modeshape-schematic will go into infinispan - what exaclty is in it that infinispan will need? (sorry for the maybe dumb-question, but I just wonder)

                   

                  The -schematic module was originally designed to be an add-on to Infinispan (basically, it's just a JSON/BSON document store with JSON Schema validation) and to live within the Infinispan codebase. The problem with doing this right now is that ModeShape would be at the mercy of the Infinispan release cycle, and the -schematic module is still undergoing a lot of changes. So we'd thought we'd mature it here, but migrate it to the Infinispan project over time. When is still under discussion.

                   

                  Hope that all makes sense.

                  • 6. Re: Next Generation ModeShape
                    rhauch

                    K. Bachl wrote:

                     

                    PS: regarding performance - i dont know how much faster it can be but using disk-connector and its inmemory cache in 2.6.0.Final the performance is already superb (ok, it allways can get faster )

                     

                    I'm glad you're happy with the performance of 2.6.0.Final! But like you said, there's always room for improvement. We do plan to do much more formal performance testing, and we'll be comparing 3.0 to 2.6.0.Final using configurations that are as close as we can get them.

                     

                    I've run some initial tests to use the JCR API to create a subgraph 1110 nodes using the following configurations of ModeShape. The test measured how long it took to create the nodes and perform a single Session.save(). The test was run 5 times for each configuration, but only the fastest of the runs is reported. (Update: the minimum times weren't significantly smaller than the average times; the first run usually took the longest.)

                     

                    And please note that these results are preliminary because the 3.0 code is not complete or ready to release, and we still may need to make significant changes. However, it is currently complete enough that creating, reading, modifying and deleting nodes is doing everything that 2.6 did.

                     

                    In-memory

                     

                    First, let's look at running ModeShape configured with all in-memory storage, where nothing is persisted:

                     

                    1) using 2.6.0.Final with the in-memory connector: 357ms

                    2) using 3.0 prototype (using Infinispan with cache loader): 45ms

                     

                    With these similar configurations, the 3.0 prototype is already almost 8X faster than the 2.6.0.Final code.

                     

                    Persisting to Disk

                     

                    Now, let's compare running ModeShape and persisting to disk. I used the disk-based connector, since it's the fastest connector that stores to disk.

                     

                    1) using 2.6.0.Final with the disk-based connector w/ caching: 2,287ms

                    2) using 3.0 prototype (using Infinispan with the BerkelyDB cache loader): 111ms

                     

                    With these similar configurations (both persisting to disk), the 3.0 prototype is already almost 20X faster than the 2.6.0.Final code. Note that the 3.0 configuration is the same as what I used in the tests mentioned in my original post, where it only takes 3.5 seconds to create 1 million nodes (periodically saving the session)!

                    • 7. Re: Next Generation ModeShape
                      jonathandfields

                      This sounds like a great direction for 3.0. Consistent with other JBoss projects like AS7, HornetQ and Infinispan - lean, mean, fast.

                       

                      Is the query index (Lucene) going to be stored in Infinispan too? Seems that  would have the potential to significantly improved query performance as well....

                       

                      Thanks,

                      Jon

                      • 8. Re: Next Generation ModeShape
                        rhauch

                        Jonathan Fields wrote:

                         

                        [...]

                         

                        Is the query index (Lucene) going to be stored in Infinispan too? Seems that  would have the potential to significantly improved query performance as well....

                        Yes, we do want to have ModeShape store Lucene indexes in Infinispan. Hopefully we can make that the new default, although it'll still be possible to store the indexes locally on the file system.