1 Reply Latest reply on Feb 27, 2012 10:47 AM by rhauch

Modeshape 3.0.0.x - Performance and storage schema

martin-senne Feb 27, 2012 5:07 AM

Hey Randell,

first, initial performance tests with 3.0.0.Alpha1 have confirmed a great speedup ( factor of ~8 in our case) compared to Modeshape 2.7.

That's great, so please keep up the good work!!!!

Nevertheless, due to the early state of the 3.0. version, queries (or more concisely: query execution) are not implemented right now. But to be able to complete our performance tests, it is essential to perform queries (mainly in JCR-SQL2).

So, the obvious question: When (and how mature) will this be in 3.0.0.Alpha2?

Second, could you please provide a brief overview or refer to some documentation/http pages, of how JCR puts nodes into Infinispan.

- How are nodes/properties mapped to key/values?

- Could you provide a starting point in the code, where to start "tracing data flow"?

- What is HibernateSearch used for?

- Is there a way to "manipulate" the way, keys are stored in Infinispan, or is the Infinispan layer kind of a "black box" from the sight of Modeshape, meaning Modeshape sees just a key/value store without knowledge about internal storage?

Cheers and thanks,

Martin

1. Re: Modeshape 3.0.0.x - Performance and storage schema

rhauch Feb 27, 2012 10:47 AM (in response to martin-senne)

First of all, thank you, Martin, for giving the 3.0 alpha a whirl. I know the documentation is sparse and features are incomplete, but we'll want all the help we can getting things stable.
first, initial performance tests with 3.0.0.Alpha1 have confirmed a great speedup ( factor of ~8 in our case) compared to Modeshape 2.7.
That's great, so please keep up the good work!!!!
Glad to see you're getting results like we are! Were you using a stock Infinispan cache, or did you set up an Infinispan configuration with a cache store (aka, cache loader)?
Nevertheless, due to the early state of the 3.0. version, queries (or more concisely: query execution) are not implemented right now. But to be able to complete our performance tests, it is essential to perform queries (mainly in JCR-SQL2).
So, the obvious question: When (and how mature) will this be in 3.0.0.Alpha2?
I'm trying to finish up the query functionality, and as soon as that's done I'd like to push out Alpha2. As I said on another thread, I'm hoping it'll be the end of this week or early next. Unfortunately, I've been getting pulled onto other projects, but I hope that's done.

The indexing system is completely rewritten in 3.0 and leverages Hibernate Search (which is a clustering utility layer on top of Lucene, and the part of HS that we're using does not depend on Hibernate Core, Hibernate ORM, or any of the other Hibernate/JPA libraries). This will dramatically simplify our indexing-related codebase, improve performance in a cluster (and hopefully also standalone), and give us a lot more flexibility in how we configure clusters. We're still using Lucene for our indexes, and the basic design of the indexes is largely the same.

The query engine (e.g., parser, planner, optimizer, processor) is largely the same, so we're thinking performance will be pretty similar to 2.x. The query results, however, will hopefully be more efficient and faster due to the improvements you've already seen with Infinispan.
Second, could you please provide a brief overview or refer to some documentation/http pages, of how JCR puts nodes into Infinispan.
- How are nodes/properties mapped to key/values?
- Could you provide a starting point in the code, where to start "tracing data flow"?

I have yet to write up the design, but I'll give an overview here.

Each node is mapped to a single Document, and then inserted into the Infinispan cache keyed by a String identifier (see NodeKey for the structure of the string). Using Document is great, because it's an in-memory representation of the JSON document (with additional BSON value types). Infinispan simply serializes the values when it needs to persist or distribute the Document value, so our Document object serializes to/from the BSON binary format. This means that any persisted value is a standard format, not a serialized Java object.

Within the Document we're storing the properties (in nested documents keyed by the namespaces), references to the child nodes, and a bit of additional metadata. And the child references start out as a JSON array of small nested documents (one per child), but as the number of child nodes grows we can actually break the child references across multiple other Documents. We call that segmenting, and it allows the child nodes to grow to very large numbers without having the node's Document get too large. We havn't figured out the optimal number of child nodes, but it's tunable and done in a background process to prevent adding extra work to the JCR access calls.

All of the JCR implementation classes eventually funnel down to a single "repository cache" layer that encapsulates all of the Infinispan and JSON mapping. There's a single RepositoryCache instance in each JCR Repository instance, while the CachedNode is an immutable representation of a JCR node (without all the JCR logic). The NodeCache is a cache of the CachedNode instances for a given workspace, and are owned by the RepositoryCache and shared by all sessions that are reading content. Each JCR Session is represented by a single SessionCache object that contains all the Session's transient modified state, but it stores all of this as a delta on top of the persisted CachedNode objects. When a session's changes are saved (in Infinispan within a JTA transaction), the CachedNode objects for those nodes are updated in the workspace's cache, so everybody sees the updated content immediately following the commit. (Most of the implementation classes are in the "org.modeshape.jcr.cache.document" package, including the DocumentTranslator class that contains all of the logic for mapping a Document value and a CachedNode.)

This whole framework is thread-safe, so it will be possible for multiple threads to use the same JCR Session object for reading content. (Strictly speaking, the classes will be threadsafe for multiple threads making changes, but conceptually that will be strange as each will be modifying the same transient state of a single JCR Session.) And even though we're using JTA transactions when we apply the changes during a Session.save(), we're trying to write this entire layer to be as "eventually-consistent" as we can. For example, when a node is added, "inserted before" another node, and then saved, we persist the changes by literally placing the node node immediately before the specified node, whereever that node happens to be when we're persisting the change. And, we're trying to be tolerant for cases when other Sessions have persisted changes (e.g., removing or moving nodes) between when our Session's changes were made by the JCR client and when the Session.save() method is persisting the changes. So while we're using JTA transactions with full ACID changes, we hope to allow users to optionally choose to use eventually-consistent behavior.

That's probably not as clear as I'd like it to be, but I hope that helps a little. If you have more questions, please ask. We're holding off on documenting while we finish the features, but we're happy to answer any and all questions!
- Is there a way to "manipulate" the way, keys are stored in Infinispan, or is the Infinispan layer kind of a "black box" from the sight of Modeshape, meaning Modeshape sees just a key/value store without knowledge about internal storage?

We aren't really focused on creating a complete abstraction layer at this time, and we've created the RepositoryCache layer primarily for encapsulation purposes. However, I'm pretty confident that this will allow us to (eventually) turn this into an abstraction layer, if we think that's a good idea. Frankly, I'm so happy with Infinispan that I'm not sure why we'd want or need to use other implementations. Remember, we'll have a connector API that will allow us to talk to external systems, while still relying upon Infinispan for caching that federated information.

- What is HibernateSearch used for?

We're only using the part of Hibernate Search that is a clustered utility layer on top of Lucene, and we're NOT using any parts that are related or dependent upon the rest of Hibernate (e.g., Core or ORM) or JPA. It basically is a really nice framework for updating and using a cluster of Lucene indexes, and that's how we're using it.
Actions