1 2 Previous Next 21 Replies Latest reply on Jun 27, 2012 12:00 AM by rhauch Go to original post
      • 15. Re: Binary values in ModeShape 3
        bwallis42

        Randall Hauch wrote:

         

        The point above about only having a single binary store, can you expand on that? I absolutely need multiple binary stores so I can partition the storage of a large amount of data across multiple storage locations that may have different performance characteristics.

        Currently, a ModeShape repository uses a single BinaryStore instance. Now, that BinaryStore implementation can store the binary content however it wants, including storing them on multiple machines based upon whatever criteria. We currently have a several BinaryStore implementations (see our initial documentation), and support using custom BinaryStores.

         

        What we've been talking about in the past few posts is for a single repository to be able to use a chain of multiple BinaryStore instances. Each BinaryStore can still do whatever it wants, but the ability to have separate BinaryStore instances would likely mean that different instances can be configured to do different things.

         

        Perhaps the one idea that might have the biggest impact, however, is changing the BinaryKey from effectively only SHA-1s to arbitrary formats, and to expose the keys to JCR clients. This allows a ModeShape installation to give the clients some control over where the binary content is stored.

         

        The binary store, whether one or multiple, would need some info on which to base its decision about where to store a particular value and I was thinking that this information would come from attribute values in the node containing the binary value or ancestor nodes of that node. If I understand the binary value interface (at the above documentation link) it doesn't have access to the containing node at the point that the decision needs to be made.

         

        I presume you're talking about this thread. Now that I've re-read your requirements, it does sound like you'd benefit from multiple (or even a custom) BinaryStore to persist the larger data the way you want. Would the improved BinaryStore capabilties be the complete solution for your federation use case, or do you still need to control where the regular (non-Binary value) content is stored, too?

        For my purposes I am really making the assumption that Binary == large and that the rest of the node tree would not occupy a large amount of storage (this is how our current system works, we only partition the file based document store). But there are other reasons besides storage size to partition the data (speed of access comes to mind) and I think that the ability to partition at the node/subtree level is important as well.

        • 16. Re: Binary values in ModeShape 3
          rhauch

          The binary store, whether one or multiple, would need some info on which to base its decision about where to store a particular value and I was thinking that this information would come from attribute values in the node containing the binary value or ancestor nodes of that node. If I understand the binary value interface (at the above documentation link) it doesn't have access to the containing node at the point that the decision needs to be made.

          The purpose of a BinaryStore is to manage Binary values based solely upon the key and content, and completely independently of where those values are stored or other content. We've chosen to create this separate concept because it allows ModeShape to more efficiently persist and stream binary content (even when in transient Session state).

           

          BinaryStores are very different than connectors, which are (or will) serve as the bridge for passing content to and from. We do think that some connectors will provide their own BinaryStore (since external systems will be reponsible for storage of their Binary values), but transient Binary values will likely still be stored (at least temporarily) in the repository's SHA-1-based BinaryStore.

          For my purposes I am really making the assumption that Binary == large and that the rest of the node tree would not occupy a large amount of storage (this is how our current system works, we only partition the file based document store).

          Agreed.

          But there are other reasons besides storage size to partition the data (speed of access comes to mind) and I think that the ability to partition at the node/subtree level is important as well.

          Agreed. And because you're basing your partitioning decisions based upon node content, I think connectors will continue to be more applicable than BinaryStores.

          • 17. Re: Binary values in ModeShape 3
            rhauch

            I am wondering if the key should explicitly specify the BinaryStore instance  along with the binary object ID within that instance to avoid ambiguity. ... That way, if "id1" is a valid id in more than one binary store, there is no ambiguity.

            Agreed. I think the string that makes up the BinaryKey would need to somehow uniquely (directly or indirectly) identify the store. You're suggestion would be a direct mapping, and this would leak the internal configuration of the BinaryStores into the client logic. Another option is indirect, which might use something similar to schemes in URIs; for example, a key such as "ftp:foobarbaz" might be handled by a particular BinaryStore, whereas "sha1:abcdef..." might be handled by another BinaryStore.

             

            However, this seems to be getting fairly complicated, and it's leaking a fair amount of responsibility and semantics to the client. I'm not only concerned about clients doing something wrong (and corrupting their persisted content), but also the security ramifications. Consider a client that can look up existing Binary content if it has the key (e.g., SHA-1 or some reference). The BinaryStore does not have the context in which the Binary value is created, so if the client has the key they can immediately create a Binary value that will return the content without any access control or authorization checks.

             

            Earlier, Jonathan described what he's looking for:

            In my use cases, my binary data (video files) needs to be accessible both from within JCR, and also as files to other applications. While this is a form of federation, it seems like it is specific to Binary storage, not to general federation. Indeed, I want to store everything but my video data  in the 3.0 infinispan based nodes.

             

            This use case makes perfect sense, and I understand the concept of having some form of federation inside the BinaryStore. But I'm struggling with how to achieve that without placing an undue burden (or reliance) on the client properly specifying and using the binary keys, and without overly complicating the binary storage mechanism for all other users. I'm also concerned about what happens when the content at a particular key is changed.

             

            Could your use case be accomplished by simply storing the videos outside of ModeShape and storing the links (e.g., URL or some other key for the external system) inside ModeShape as simple properties? Your application would manage the uploading and streaming of the video files, while the links to the videos can be managed with the rest of the content. Since ModeShape can't really deal with the raw video binary content anyway (e.g., sequence, full-text search, etc.), is there a benefit to having ModeShape store the content?

             

            Or, perhaps you could store the videos inside ModeShape, and to create the URL to the actual content (that is sent to the client) using the path to the node(s) that contain the Binary property value. This might work with the FileSystemBinaryStore, or a custom BinaryStore implementation (still SHA-1-based) could always be used.

             

            Thoughts?

            • 18. Re: Binary values in ModeShape 3
              jonathandfields

              In my use cases, my binary data (video files) needs to be accessible both from within JCR, and also as files to other applications. While this is a form of federation, it seems like it is specific to Binary storage, not to general federation. Indeed, I want to store everything but my video data  in the 3.0 infinispan based nodes.

               

              This use case makes perfect sense, and I understand the concept of having some form of federation inside the BinaryStore. But I'm struggling with how to achieve that without placing an undue burden (or reliance) on the client properly specifying and using the binary keys, and without overly complicating the binary storage mechanism for all other users. I'm also concerned about what happens when the content at a particular key is changed.

               

              Could your use case be accomplished by simply storing the videos outside of ModeShape and storing the links (e.g., URL or some other key for the external system) inside ModeShape as simple properties? Your application would manage the uploading and streaming of the video files, while the links to the videos can be managed with the rest of the content. Since ModeShape can't really deal with the raw video binary content anyway (e.g., sequence, full-text search, etc.), is there a benefit to having ModeShape store the content?

               

               

              That, in fact, is  what I am currently doing in a  prototype system,  using observation to keep JCR and the files in  synch. For example, if the node containing a property referencing a file is deleted, the observation listener deletes the corresponding file. I am satisfied with moving forward with this  approach, if what we have been discussing has become too complicated. I certainly don't want this to be a distraction from getting the core Modeshape 3.0 out there, fast and rock solid. I am very comfortable with moving forward without this extension.

               

              The  reason that I brought up the topic was that it appeared  as if the current approach was fairly close to what I needed, if only the underlying key could be specified/retrieved, and the assumption that it was a SHA1 was removed. The main benefits that I saw were: a) Being able to "tie" an existing file to a Binary, and have it automatically deleted if the property is deleted; and b) opening up the files  to sequencing to extract video metadata and add that to JCR. (I have existing code that does that, that I would package as a sequencer). This would use more of Modeshape and less application specific code. However, those benefits are probably not significant enough to warrant the cost.

              • 19. Re: Binary values in ModeShape 3
                hchiorean

                This use case makes perfect sense, and I understand the concept of having some form of federation inside the BinaryStore. But I'm struggling with how to achieve that without placing an undue burden (or reliance) on the client properly specifying and using the binary keys, and without overly complicating the binary storage mechanism for all other users. I'm also concerned about what happens when the content at a particular key is changed.

                 

                I think this is an important aspect which we should take into account: if we expose (through the public jcr interfaces) the entire Binary Store/Key abstraction, not only it would make things more complex (on top of an already semi-complex API), but things could get messy pretty easily by clients misusing these abstractions (the protocol/binary store mapping in the key is a good example).

                 

                On the other hand, having such an abstraction would definitively give "advanced clients" a lot more power.  IMO, we should keep this issue open and see if any other cases in which this would be needed pop-up.

                • 20. Re: Binary values in ModeShape 3
                  jonathandfields

                  I don't know if this is a common use case in general, but it is for people who are managing large amounts of binary data. I have been working on audio visual content management systems for nearly 15 years, and we have always ended up "federating" the audio-visual data in files with some kind of metadata system, typically a relational database. We just cannot "lock up" the data in a closed silo like LOBs in a database or Binaries in JCR, because other systems, editors, servers only work with files or URLs. Nor do we have the physical storage space or time to copy nearly a PB of data from one store into another. So some way of tighly integrating existing data in-situ with the metadata has always been a desire.

                   

                  Prior to using Modeshape, I had prototyped a system using Jackrabbit that effectively just replaces the relational DB metadata store with Jackrabbit. JCR  for metadata has a lot of benefits over a relational DB. But, there were still two separate data stores that needed to be kept in synch. When I discovered Modeshape  which "isn't yet another silo of isolated information, but rather it's a JCR view of the information you already have in your environment: files systems..." I thought that I may have found my solution. I first prototyped a solution using 2.x connectors (file system for files and infinispan for metadata) using references to tie them together. However there were some complexities along the way. These were related to the references and the query index not getting updated when new files were created.  In the end, the solution turned out to be almost as complex as storing and synchronizing the URLs to the files. Given that, and that it was not clear whether there was  going to be any form of file system connectors and federation in 3.0, I backed off this approach.

                   

                  So I am back to the "store the URL in a property" approach, using observation to  keep JCR and the file resources in synch. So far, this works OK, although quirks in the JCR observation API have been a challenge. For example, it seems that when you delete a node, you are only notified of that node's deletion, not its children. Therefore, if a node is deleted, and a child node contains a property referencing a file, there is no way to know to delete that file.....  I am still experimenting with the viability of this approach. If I cannot make it work, I'm back to synchronizing JCR and the files in the application layer.

                  • 21. Re: Binary values in ModeShape 3
                    rhauch

                    ... although quirks in the JCR observation API have been a challenge. For example, it seems that when you delete a node, you are only notified of that node's deletion, not its children. Therefore, if a node is deleted, and a child node contains a property referencing a file, there is no way to know to delete that file...

                    ModeShape 3.0 behaves a little differently than in 2.x: a NODE_REMOVED event will be generated for all nodes in the deleted subgraph, although I think the event for the parent node occurs before the event for the child nodes.

                    1 2 Previous Next