6 Replies Latest reply on Dec 12, 2011 10:06 AM by galder.zamarreno

    Storing large collections in Infinispan

    ydewit

      I am looking for some recommendations on how best to store large collections (ordered/unordered) in Infinispan.

       

      I am basically trying to store a parent entity that can contain a large number of child entities, either ordered or not. The simple solution is to store the collection of child IDs in a collection (arraylist or hashset) with the parent entity so that when I get a specific parent entity I will also get the list of IDs to the children entities and can fetch the children individually as needed.

       

      The first problem with this approach is that any updates to the collection would require updating the parent entity including the whole collection. The natural solution here would be to detach the collection from it's parent so that we have one cache entry for each parent entity, one separate cache entry for each parent-collection, and one cache entry for each child entity.

       

      The second problem is that the collection itself could be quite large. An arraylist with say 1 million entries would take around 50Mb of memory and updating this list would be a drag to say the least if the whole collection has to be updated in the cache. The natural solution here would be to split the collection into smaller chunks effectively paginating it. And a page size of 1 would be equivalent to having the parent ID stored in the child entity as a foreign key. Here, the issue then becomes how to query all the child entities in the cache with a given parent ID and to return only N entries starting at position P. And afaik, this is not directly supported by the Infinispan APIs but can be done with something like Lucene on top of it providing the required indexing.

       

      1. The first question is whether something like Lucene is the only solution here or are there other options/recommendations. I came across something about AtomicMap, but not sure how it could help here (pointers appreciated if it is a viable options).

       

      2. And the second question is how reliable is lucene's indexing in a clustered environment with Infinispan? My concern here stems from the fact that, in the event of an issue with the indexing/lucene, querying the children of a given parent entity is much more critical to the app functionality than having full text search enabled in a drop-down in the UI.

       

      thanks in advance for any insights into this,

        • 1. Re: Storing large collections in Infinispan
          galder.zamarreno

          Judging from your use case, you might wanna be having a go at using Hibernate OGM rather than Infinispan. Hibernate OGM offers a higher level JPA-like API for storing entities and what's interesting for you is that it deals with the process of dehydrating entities and collections, saving you all the headaches of figuring out how to store these internally. Also, since these are dehydrated, it can send deltas as opposed to serializing the full collection. In fact, Hibernate OGM internally uses Infinispan Atomic Maps to do this.

           

          I believe querying support for Hibernate OGM is still work in progress though.

          • 2. Re: Storing large collections in Infinispan
            sannegrinovero

            As Galder says, with OGM wi do map bi-directional relations between "entities" to Infinispan. You could use it directly or do the same, using AtomicMaps. "JPA-like" queries are not integrated yet, but integration with indexing via Hibernate Search works fine, this is in case you need more query capabilities other than floolowing relations across your opjects (OGM doesn't need any query engine to do that).

             

             

            1. The first question is whether something like Lucene is the only solution here or are there other options/recommendations. I came across something about AtomicMap, but not sure how it could help here (pointers appreciated if it is a viable options).

            AtomicMap, FineGrainedAtomicMap (uses a separate lock for each entry), the Tree API or Map/Reduce which can be used pretty well to search for elements.

            I'm sorry the AtomicMap is not explained in the docs, as it's trivial. Javadocs are very complete, they should be good; if you have more doubts please ask.

             

            2. And the second question is how reliable is lucene's indexing in a clustered environment with Infinispan? My concern here stems from the fact that, in the event of an issue with the indexing/lucene, querying the children of a given parent entity is much more critical to the app functionality than having full text search enabled in a drop-down in the UI.

            A Lucene index is not easy to corrupt, but in case of need it can always be rebuilt from the values. Anyway, if you only need to query for this parent/child relation it's not your best choice, I'd suggest the AtomicMaps or the Tree API.

            • 3. Re: Storing large collections in Infinispan
              ydewit

              Galder and Sanne,

               

              Thanks for your replies! I will take a closer look at OGM in the next weeks, but it will not be something I will be able to leverage in the short term.

               

              I am intrigued, though, by the idea of using an AtomicMap (FineGrainedAtomicMap), but I don't fully grasp how are you using this in OGM to represent relationships (bi-directional or otherwise).

               

              1. Are you basically storing the AtomicMap as a separate cache entry from the parent and the child entities?

              2. I would imagine that the cache entry key for the AtomicMAp would be something like ParentID+RelationshipRole

              3. Is the child ID stored as a key in the AtomicMap and the AtomicMap values unused?

              4. My guess is that ordered relationships are not supported without resorting to some child id indexing (which would mean rewriting the ids when children are added/removed.

              5. Does the AtomicMap help when the number of children entities are large? i.e. does it provide any lazy loading or paginated loading Map entries? (I understand this should not be a problem for most case since there should only be IDs there, but curious anyway).

               

              thanks for your insights!

              • 4. Re: Storing large collections in Infinispan
                sannegrinovero

                Hi Yuri,

                Hibernate OGM remaps the relational model like Hibernate Core would do to a database, just it uses a key/value store. So an entity is de-composed in primitives, to shield users from mutability of objects, serialization problems, class definition updates and schema changes. For relations you could think at how bridge tables are used in a database to map many-many relations: using foreign keys. The tricky aspect of a Key/Value store is that it can't query by other way than by get(key) (a database would do a query using one of the FKs as fixed known parameter), so both sides of the relation store the full set of IDs of the other side in a specific FGAM, like you say using parentdID+Role to identify the key, but the values are actually the IDs of the other side.

                 

                Might be easier looking at the pictures in chap. 3.2 of the docs: http://docs.jboss.org/hibernate/ogm/3.0/reference/en-US/html_single/#ogm-architecture-datapersisted

                 

                4: yes. this might be tricky, hence the reuse of Hibernate Core's engine which is rock solid.

                 

                5: almost: all children are downloaded at least once, as it's still considered a single value by Infinispan: so it does not provide lazy loading, but it makes sure only the deltas are rewritten minimizing subsequent network load. Seems to work pretty well as this strategy minimizes the amount of network packets.

                • 5. Re: Storing large collections in Infinispan
                  ydewit

                  Sanne,

                   

                  That documentation what quite helpful!

                   

                  Looking at the 'storing associations' picture, I see the first 4 rows in the table (in the picture) are entities and the next 4 rows are relationships. I guess the AtomicMap is actually the value of the last 4 rows, right? Where the value in the first 4 rows are a simple tuples containing the entity properties. Very clever! And here in the association/relationship is where the AtomicMap shines, I guess.

                   

                  thanks for the insight.

                  • 6. Re: Storing large collections in Infinispan
                    galder.zamarreno

                    @Yuri, for more info on how OGM uses atomic maps:

                     

                    1. git clone https://github.com/hibernate/hibernate-ogm.git

                     

                    2. Open with your fav editor