3 Replies Latest reply on May 19, 2014 4:20 PM by rhauch

    Best practices for clustering in ModeShape 4.0

    ma6rl

      I am currently in the process of working on an early prototype for an application we will be developing later this year. The application's data model and query needs are a natural fit for JCR and after evaluating both ModeShape and JackRabbit we have come to the conclusion ModeShape best meets our needs.

       

      Given the timeframe of the application and the exciting developments happening in ModeShape 4 we are looking to architect our application based on features in ModeShape 4. One of the challenges with this is that ModeShape 4 is still in the early stages of development and a lot of the documentation and existing solutions are focused on the 3.X version.

       

      While I understand that everyones requirements are different and there is not a 'one size fits all' solution what I am hoping to achieve is to pull together a list of best practices and some example configuration files for deploying a ModeShape 4.0 in a cluster using Wildfly 8.0 and the ModeShape subsystem. These best practices will be based on the planned ModeShape 4.0 feature set and not the existing feature set, which means not all of this will be possible with the current alpha releases. I plan to share this information on this forum to help other people looking into this.

       

      Given this I have a number of questions around new features (including those that are still being developed) in 4.0 and how best to take advantage of them.

       

      1. Based on previous discussions (see https://community.jboss.org/message/871616#871616) it looks like a good starting point for managing indexes in a cluster is for each server in the cluster to maintain it's own local index (although there are some advantages to having a master index and using JMS to update it or using an external index provider). Currently in looks like ModeShape 4 is going to initially offer support for local file system and local Lucene indexes. Given this:

       

        a) What are the pro/cons for using the a file system Vs. Lucene? Is there any documentation that discusses this?

        b) How do you configure indexes using the Wildfly subsystem?

        c) Will it be possible to migrate indexes from local to an external index provider?

       

      2. Based on the 4.0 documentation and standalone-modeshape-ha.xml shipped with the wildfly subsystem it looks like all persistence storage and clustering is now configured via Infinispan, and the Modeshape configuration just references the cache-container and cache to use. Given the complexity and learning curve associated with configuring Infinispan it would be good to provide some guides and example of some common configurations that work well with ModeShape that go beyond a basic local or replicated cache that uses a file store. In order to put together some examples I have some questions (although some of these might be more relevant for the Infinispan forums it would be good to look at them from the perspective of ModeShape).

       

        a) Looking through the Inifinispan documentation it looks like there are large number of Cache Stores available given this variety of choice what are the most suitable choices for a ModeShape cluster (e.g. File, JDBC, LevelDB)?

        b) What are the recommendations around shared Vs non-shared cache stores in a ModeShape cluster?

        c) When configuring a Cache Store for use in a ModeShape cluster inside a Wildfly environment what is the best transaction level to use (NON_XA, FULL_XA, other)?

        d) What are good defaults for the following cache store values when used in a ModeShape cluster and why:

            - passive

            - mode (sync vs async)

            - JDBC cache loader type (if using a JDBC Cache Store)

            - eviction strategy

            - transaction locking mode

       

      3. What are the best practices around configuring binary cache stores (data and metadata) in a ModeShape cluster?

       

      Many thanks in advance for your help.

        • 1. Re: Best practices for clustering in ModeShape 4.0
          rhauch

          1. Based on previous discussions (see https://community.jboss.org/message/871616#871616) it looks like a good starting point for managing indexes in a cluster is for each server in the cluster to maintain it's own local index (although there are some advantages to having a master index and using JMS to update it or using an external index provider). Currently in looks like ModeShape 4 is going to initially offer support for local file system and local Lucene indexes. Given this:

           

            a) What are the pro/cons for using the a file system Vs. Lucene? Is there any documentation that discusses this?

            b) How do you configure indexes using the Wildfly subsystem?

            c) Will it be possible to migrate indexes from local to an external index provider?

          a) We don't know quite yet what the pros/cons will be for the file system vs Lucene. It may be as simple as: if you want full-text search, use Lucene for those FTS indexes, use file system for everything else. I'm currently working on the initial file-system based index provider, so it is still evolving.

           

          b) Index definitions and index providers are both defined in the XML configuration file for Wildfly as part of the ModeShape subsystem. The XSD (as it currently stands) is here. Basically, it will look something like this test configuration (which has the providers commented out because the index providers are not yet functional). You also can use the other management mechanisms, such as command line interface (CLI), to dynamically add/change/remove the index definitions and index providers.

           

          c) I don't expect to need to migrate indexes from one provider to another. When a new index is defined, a provider will be able to request that the workspace(s) be reindexed so that the new index can be populated. So "migrating" an index from one provider will likely be equivalent to dropping the index from one provider and adding it in the other.

           

          2. Based on the 4.0 documentation and standalone-modeshape-ha.xml shipped with the wildfly subsystem it looks like all persistence storage and clustering is now configured via Infinispan, and the Modeshape configuration just references the cache-container and cache to use. Given the complexity and learning curve associated with configuring Infinispan it would be good to provide some guides and example of some common configurations that work well with ModeShape that go beyond a basic local or replicated cache that uses a file store. In order to put together some examples I have some questions (although some of these might be more relevant for the Infinispan forums it would be good to look at them from the perspective of ModeShape).

          I think you've already seen the quickstart examples. We do want to improve them and provide several configurations that showcase very common aspects. Of course, if you have any suggestions for new examples, or would like to help contribute to them, please let us know. As with everything else, we take pull-requests.

           

            a) Looking through the Inifinispan documentation it looks like there are large number of Cache Stores available given this variety of choice what are the most suitable choices for a ModeShape cluster (e.g. File, JDBC, LevelDB)?

          ModeShape 4.0 uses Infinispan 6.0, which introduces several new cache stores, including a new file system cache store and a LevelDB-based cache store. None of the core ModeShape developers has had time to compare/contrast these, but from what I've heard on the Infinispan mailing list the LevelDB seems to be preferred and quite performant. Of course, when used in a cluster it may be preferred to have each process in the cluster configured with a non-shared cache store; all sharing would be done via Infinispan's distributed architecture. (If you look at other uses of LevelDB in clusters, they often will use some technology to synchronize the different instances running in the cluster. Infinispan already does this.) We've not tested these configurations to see how much faster they are, but I do hope to rely upon the Infinispan community for lots of help in the coming months. (And it'd be great to have the community already start testing and considering options while we focus on features.) The good news is that we have more options with Infinispan 6, and all the new cache stores attempt to address the slower performance of some of the older cache stores.

           

            b) What are the recommendations around shared Vs non-shared cache stores in a ModeShape cluster?

          It really depends. With Infinispan 5, we didn't have many options. But as I stated above I think we have more realistic options with Infinispan 6. IMO, it boils down to whether you want to have multiple stores that are kept in sync, or whether you want a shared cache store that already has some form of backup/fault-tolerance/high-availability.

            c) When configuring a Cache Store for use in a ModeShape cluster inside a Wildfly environment what is the best transaction level to use (NON_XA, FULL_XA, other)?

          NON_XA is usually recommended. I'm not sure anything else makes sense for ModeShape 3 or 4

           

            d) What are good defaults for the following cache store values when used in a ModeShape cluster and why:

                - passive

                - mode (sync vs async)

                - JDBC cache loader type (if using a JDBC Cache Store)

                - eviction strategy

                - transaction locking mode

          With Infinispan, passivation is the mechanism by which some entries are pushed out of memory and into storage. When passivation is enabled, the entries in memory are not in the store, and the entries in storage are not in memory. Always turn passivation OFF for ModeShape. (There may be some exceptions, but they will be few and far between.)

           

          Infinispan either writes information to the cache store in a synchronous fashion (e.g., the entries are persisted in the cache store before the Infinispan method returns to ModeShape's code, aka "write through"), or in an asynchronous fashion (e.g., the entries are persisted in a separate, asynchronous thread, aka "write behind"). Which to use with ModeShape depends upon the guarantees that you require should components/processes start failing. A conservative option is to use SYNC.

           

          If you're using a JDBC cache store with ModeShape, always use the JdbcStringBasedCacheStore. ModeShape always uses strings for its keys, so the other JDBC cache stores add no value whatsoever.

           

          Infinispan's eviction involves throwing entries out of the Infinispan cache (e.g., ModeShape's storage) that are older than some amount of time. Never use eviction for the caches in ModeShape - the only exception is the workspace cache configuration that is literally used as a cache and is configured entirely differently than the normal storage caches.

           

          Transaction locking mode should generally be PESSIMISTIC for ModeShape, unless you can guarantee that your application will not have multiple threads attempting to update the same nodes. (Sometimes this is feasible when all modifications are pipelined via a queue, but in general this is pretty rare.)

           

          3. What are the best practices around configuring binary cache stores (data and metadata) in a ModeShape cluster?

          IMO the primary reason to choose file system, database, Infinispan storage, or other for binary values is simply what makes sense for your application. Non-clustered topologies, file system is entirely sufficient. However, this may not work well with NFS-shared directories (locks are notoriously brittle on NFS mounts). So for clustered topologies, you probably want some mechanism by which the binary storage is shared, so that's either one of the databases (JDBC, MongoDB, Cassandra, etc.) or Infinispan (distributed, replicated). Note that it probably doesn't make sense to use Infinispan if you're just going to configure it to store things in a JDBC cache store. And at this point, you should choose the kind of database (or data grid) based upon your familiarity with managing it and the persistence guarantees (e.g., ACID or eventually consistent, etc.).

           

          Hope this helps, or is at least a good start.

          • 2. Re: Best practices for clustering in ModeShape 4.0
            ma6rl

            Sorry for the slow response. This is a great start and has given me plenty to work with.

             

            I am currently working on some additional example configurations that could be included with the modeshape-clustering quick start. These include:

             

            - A JDBC example

            - An example that works on AWS (EC2) by using the JGroups TCP stack along with S3Ping to work around the UDP/Multicast limitations in AWS.

             

            With the JDBC example would it be more useful to have it a) use the default in-memory ExampleDS datasource or b) Have it use a MySQL datasource that would need to be setup prior to running the example?

            • 3. Re: Best practices for clustering in ModeShape 4.0
              rhauch

              With the JDBC example would it be more useful to have it a) use the default in-memory ExampleDS datasource or b) Have it use a MySQL datasource that would need to be setup prior to running the example?

              IMO, I like examples that run out of the box with minimal setup or requirements, so I'd favor using the ExampleDS data source. Of course, the README could certainly have instructions on how to change it to use MySQL, Postgres, etc. If things get too complicated, then we could always have multiple JDBC quickstarts.