Brian, thanks again for the discussion. BTW, I've only really started using the distinction between 'stored' and 'external' content in the last month. There was no mention of it in the earlier NG ModeShape post.
In my case, for a new installation, all content would be stored. We are not planning on having the content accessible via other means. That said, we also have the case of legacy data stored in the filesystem and some database tables (for which I wrote a 2.6 connector) and I suppose that this can be seen as external data.
I would say that if it's not stored in the main Infinispan cache(s) used by a repository, then I'd call it external. So for example, in 2.x the in-memory connector, the JPA connector, the disk connector, the Infinispan connector and the JBossCache connector are all examples of "storage" connectors, because whatever data they manage (store) is only meant to be accessed through ModeShape and not directly by other third-party applications. In other words, they "own" and are wholly in charge of the data.
In 2.x, the other connectors are examples of connectors that deal with what I'd now refer to as external content: file system connector, JCR connector, JDBC metadata connector, SVN connector. Data in all these are easily accessible/changeable from other applications. If your database connector accesses content from a database's tables that are or can be used by other appplications, I'd consider your connector in this bucked, too.
I'm not sure I understand where the new type of connector sits. Would infinispan caching still be used for the external data? I assume that (at least in part) it is infinispan that gives you the performance benefits you mentioned in the Next Gen article. How is the new type of connector different from a cache loader under infinispan?
Well, it's still only an idea at this point. To get a better understanding of why cache loaders are difficult to use, let's talk about how ModeShape 3 uses Infinispan.
ModeShape 3 places into an Infinispan cache a single entry for each JCR node. That entry is keyed by a string key (which you can think of as containing in part a UUID, tho this is not always the case), and the value is a JSON/BSON document object (using a class of our own creation). The document class has a marshaller that Infinispan uses when it needs to serialize or deserialize the data (note that Infinispan does not rely upon Java serialization), and our marshaller serializes and deserializes to and from the BSON binary format.
Now, each Infinispan cache can be configured a couple of different ways.
- Keep everything in-memory - Each entry is distributed or replicated across the grid. IMO this offers a very intriguing performant but faul-tolerant option, since multiple copies of each entry can be effectively "backed up" across multiple machines and data centers and sites. In this case, Infinispan is a full-tilt data grid.
- Keep most things in-memory, but persist only what's not in-memory - Most (recently-used) entries would be kept in-memory, but other entries (likely those that haven't been used in a while) would be persisted using a cache loader. In this case, Infinispan is still more like a data grid than a traditional cache.
- Persist all entries and keep a subset in-memory - Again, whichever processes on which a value is distributed/replicated, uses the cache loader to persist every value outside of the processes memory. Only some of those that are needed are also cached in-memory. In this case, Infinispan is more like a traditional cache.
So if we were to use cache loaders for dealing with both "stored" and "external" data, we certainly could use different caches for each, and we'd have one cache for each "external" store (so they could use cache loaders with different configurations). But let's think about how that might work.
Consider what a cache loader "sees" when it accesses a particular node. The cache loader is asked to get the bytes for the node with a particular key, and it simply finds and returns the stream bytes for that node. What about persisting? Well, the cache loader is asked to "put" the node with a particular and a stream of bytes, and the cache loader merely streams those bytes into the appropriate slot in whatever its using for storage.
Now consider what might happen if we used cache loaders to access an external system. The cache loader is asked to get the bytes for the node with a particular key - and that's all the insight the cache loader has. How is it supposed to know how to access the external system? Sure, it could interrogate the key and assume there's information encoded in the key, but this is quite complicated. How about when the cache loader is asked to "put" (e.g., persiste) the node with a particular key and a stream of bytes? Now the cache loader has to re-materialize the JSON/BSON document from the streamed data before it can look at the properties stored in that node's document and do something with it. So while this step is feasible, it's not efficient.
So implementing a cache loader that is "dumb" is very straightforward, and but it's a lot harder to implement a "smarter" cache loader simply because the cache loader has to do more work, or because not all information might be available. "Storage" cache loaders are trivial, and in fact we can reuse all of Infinispan's existing cache loader implementations. Cache loaders that access "external" systems are (very) hard.
My current thought is that we can continue to use cache loaders as they were intended, but that we can (re-)introduce the notion of a connector to external data above the Infinispan layer. And that connector can have access to the node object itself, meaning it can leverage the node's properties (and even children, if that's useful) to figure out how to interact with the external system. We can even cache the result in a different Infinispan cache for a short (but configurable) period of time, so we don't have to keep going back to the external system every time it's accessed.
I think that the configuration of the connector needs to be independent of the metadata. The metadata needs to fully define the selection of a connector but not the configuration of that connector. The reason being that there could be multiple independent subtrees stored into a single connector not just a single subtree and its descendents. Similarly a tree of nodes could be stored into multiple connectors (up to one per node!) as you traverse down to the leaves.
I would think that if a node is marked as federated then it, it's properties and children would all be federated unless any of the children are marked as being federated somewhere else.
Good feedback. I think we might be on the same page here.
I hadn't really thought about binary content and my only requirement would be that however it is done it needs to be transactional.
This is probably another topic altogether (and this post is already way too long), but ...
Our new binary storage framework is actually not transactional per se. Instead, it immediately stores binary content as soon as you create a Binary value; a failure to store the binary content will result in an exception. Note this all happens before calling Session.save(). We don't need transactions for two reasons:
- binary content is keyed by the SHA-1 of the content, so its independent of where it's being used.
- even if the Session changes are aborted or not saved, the binary content will go used and will (eventually) get garbage collected by our framework.