Will the index files that lucene creates be distributed across the cluster? Does lucene work with multiple lucene-core engines updating the indexes from the multiple hosts in a cluster?
If I'm using infinispan under the JCR (as 3.x does) then my data is distributed but what about the indexes. Are they also stored into infinispan and similarly distributed?
I am currently thinking about how the JCR is going to work in a clustered environment. The current version, 2.8, stores the indexes in the filesystem at the path defined by
the value in the configuration element queryIndexDirector, will 3.x be similar? AFIK JackRabbit in a clustered environment keeps a full index on each cluster node and this index is maintained independently on each node (I'm not really sure about this, could be wrong).
As is obvious from the above, I have no idea how this does or will work
In short, there are several options for how Lucene indexes can be stored.
Using Lucene within a cluster has some challenges (e.g., only one writer can update the indexes at a time). So rather than directly use Lucene like we did in 2.x, in 3.x we're using Hibernate Search as a framework for updating and managing the Lucene indexes. (Important: we're only using the bottom half of Hibernate Search, which they call the "engine" and which does not depend on Hibernate ORM or JPA.) The Hibernate Search engine is essentially a clustering utility layer for Lucene, and it gives us the flexibility to configured different ways that a cluster can update the indexes in very efficient ways.
Each ModeShape 3.0 JCR repository instance will use a Hibernate Search engine to update the indexes, even when running in a cluster. So there will be two primary decisions for how to manage the indexes:
- Where should the indexes be stored?
- If clustering, can each process in the cluster update the indexes directly?
The options for storing the Lucene indexes are:
- Filesystem - the indexes are stored on the file system
- Infinispan - the indexes are stored (and optionally distributed) in dedicated caches within the Infinispan grid (see here for more info)
- RAM - the indexes are stored in-memory (obviously usefullness is limited)
- Custom Lucene directory implementations
Non-clustered repositories are pretty easy: there's only one set of indexes, so the repository can directly update them. This may even work with some clustered situations: if storing in Infinispan, or if storing on the filesystem and each process in the cluster has efficient access to the file system. But in larger/more complicated cluster topologies, it may be more efficient (or even desirable) to have only one of the processes write to the indexes, and to have all other processes forward (through JMS or JGroups; see here for more details) their writes to the one master process. And when storing the indexes on the file system, each cluster process will likely want it's own copy of the indexes for reading, so Hibernate Search engine provides a variation of the filesystem storage option where there's a single master set of indexes stored on a filesystem and the other processes have read-only copies (updated various ways).
In summary, there's a lot of flexibility here, but hopefully we can make it very easy to set up ModeShape for most non-clustered and clustered situations while still allowing those fewer cases that need it the ability to access and control that flexibility. We think that storing the indexes in Infinispan will be the easiest and best performing option for most clusters (and maybe even for non-clustered repositories, too).
All of this can be configured right now (even with Alpha2), but we don't yet have any documentation describing how to do it. If you're interested in trying this, let us know and we can help you with the configuration.