13 Replies Latest reply on Sep 28, 2015 4:16 AM by sberthez

    ModeShape (4.4.0) local MapDB index seems not working with big data

    sberthez

      I am trying to deal with a problem encountered when you have many thousands of custom nodes (nearly 300 000 for now) with specific properties you want to use in queries. I have declared an index and it appears into the explain plan of the query. For now, i can only use it when i have few nodes (for example 10 000 nodes) but when i have 300 000 nodes everything is stuck. I can clearly see the files used by local MapDB not growing (always 7MB) despite the fact it is growing for my other dataset with less nodes.

       

      What could be the problem ? Is there any way to verify what is going on and if background indexer is working ? Is there any limit to the index size or number of nodes ? When you have many thousands of nodes you want to query, what is the best strategy ?

       

      Is it also possible to not reset the MapDB base on restart ?

       

      Is there any documentation to implement our own index provider ? I use a JDBC connector for Infinispan persistence and i would like to use my DB also to manage indexes.

       

      Regards

        • 1. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
          hchiorean

          What could be the problem ? Is there any way to verify what is going on and if background indexer is working ? Is there any limit to the index size or number of nodes ?

          It's not something we've seen before, so you'll have to investigate/debug your own system. There aren't any limits as far as sizes of indexes or number of nodes stored in an index. It may be a MapDB issue or something else....

          When you have many thousands of nodes you want to query, what is the best strategy ?

          The only strategy I can think of is making sure you've defined indexes for those properties/node types which you frequently query for. This will should give you better performance (search time).

          Is it also possible to not reset the MapDB base on restart ?

          What do you mean ? If the DB is present it will merely be opened and each stored index will simply read the existing collections from this DB. For example modeshape/LocalMapIndex.java at master · ModeShape/modeshape · GitHub

          Is there any documentation to implement our own index provider ? I use a JDBC connector for Infinispan persistence and i would like to use my DB also to manage indexes.

          Not really. You have to look at how the local indexes are implemented. The API is quite complex unfortunately. The package you should look at is: modeshape/modeshape-jcr/src/main/java/org/modeshape/jcr/index/local at master · ModeShape/modeshape · GitHub

          • 2. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
            sberthez

            It seems i have an inconsistency in index definition. I used JSON file definition to configure indexes and indexes do not correspond to definitions stored into the workspace. I did not know that some informations related to index definitions are also stored into "/jcr:system/mode:indexes/local" and any modification of JSON file do not unregister preceding indexes. I will unregister them and retry.

             

            In general, do you think that number of declared indexed could be the problem ?

            • 3. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
              hchiorean

              I used JSON file definition to configure indexes and indexes do not correspond to definitions stored into the workspace. I did not know that some informations related to index definitions are also stored into "/jcr:system/mode:indexes/local" and any modification of JSON file do not unregister preceding indexes. I will unregister them and retry.

              You can define indexes in 2 different ways: via the repository JSON config or via the IndexManger API (modeshape/IndexManager.java at master · ModeShape/modeshape · GitHub). As you're "stabilizing" your node schema, it's normal that you'll change some definitions (e.g change a certain index definition's type). Each time you make these changes to existing index definitions, you should make sure you remove previously stored data for those indexes (either by removing the MapDB data stored on disk or using the aforementioned API).

              Once you've finalized your schema however, you shouldn't keep changing your index definitions - you can always add/remove index definitions - but changing existing ones is not recommended.

               

              In general, do you think that number of declared indexed could be the problem ?

              Index definitions are stored as JCR nodes in the system area. Local Indexes on the other hand (the materialization of those definitions) are stored as MapDB collections. In either case, there isn't any built-in limit so you can define as many as you want. However, if you "optimize" your node schema, it's unlikely that you'll have to define a large number of indexes. The documentation Query and search - ModeShape 4 - Project Documentation Editor has a simple example which shows how you can "evolve" your node schema to optimize index definitions.

              • 4. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
                sberthez

                Little by little i understand more how it works and also constraints . I currently adapt the design accordingly. My new challenge will be to avoid path traverse during my queries.

                 

                For example i have now the following design :

                /storedoc/<folder level 1>/<folder level 2>/ [nodes of type my:doc]

                I have 300 000 nodes of type "my:doc" and folder levels is computed from split hash to distribute documents into these folders and avoid huge child count.

                 

                When i run the query "select ... from [my:doc] where ...", the ModeShape/JCR engine seems to traverse all possible nodes to search for any instance of "my:doc". Is there any index i could create to avoid this traverse mode ? For example, an index on the node type or mixin. Is it possible ? To be honest i am really surprised there is not already a such index behind the scene by default.

                • 5. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
                  hchiorean

                  When i run the query "select ... from [my:doc] where ...", the ModeShape/JCR engine seems to traverse all possible nodes to search for any instance of "my:doc".

                  Did you try defining a node-type index on [my:doc] ?

                  To be honest i am really surprised there is not already a such index behind the scene by default.

                  There aren't any indexes behind the scenes. They are fully optional - you either define them or you don't - in which case the repository will fall back to loading all the nodes in memory and checking the query criteria against them.

                  • 6. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
                    sberthez

                    Did you try defining a node-type index on [my:doc] ?

                    How can you do that ? I mean i can define a column index dedicated to the specific node type and property (column) but it will not be used for other properties :

                    Ex :

                    IndexByStatus" : {

                            "DocumentByStatus" : {

                                "kind" : "value",

                                "nodeType" : "my:doc",

                                "columns" : "my:status(STRING)",

                                "provider" : "local"

                            }

                        }

                     

                    In the guide related to query i have seen that "kind" can be "NodeType". It is supposed to be "An index specialized to track the node types for nodes". How can i declare it ?

                     

                    There aren't any indexes behind the scenes. They are fully optional - you either define them or you don't - in which case the repository will fall back to loading all the nodes in memory and checking the query criteria against them

                    In my case, i am sure everything is into memory but still very slow on query.

                    • 7. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
                      hchiorean

                      You can see different types of indexes here: modeshape/repo-config-local-provider-and-notional-indexes.json at master · ModeShape/modeshape · GitHub

                       

                      But in your case, if you've defined a value index on [my:doc], the query engine should use this index whenever you're querying for node of type [my:doc], so you don't necessarily need to define a type-based index. Whatever performance problems you're seeing when querying are not coming from ModeShape looking at all the nodes in the repository....

                      • 8. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
                        sberthez

                        Not sure to clearly understand what you mean. If the index of the node type returns all 300K node references, the engine do not have to scan other nodes. I hoped that a single scan in memory of 300K nodes could be fast. Do you mean it will never work as i expected ? I hoped to have at least similar performances than NoSQL MongoDB for example with 300K documents stored and without any index. I mean ModeShape works in memory, all nodes are here.

                         

                        Also, strange behavior, when I execute the explain plan of the query, i can clearly see that index i have declared will be used. But when i verify into the file system for the MapDB files, they seem to be very small compared to my other sample with only 30K nodes. There is something strange. When i kill the program because query do not respond after 30min i can see the MapDB file growing (hundreds MB) and then go back to small files. I have the feeling that indexes are computed but fail to be written into filesystem when i have 300K nodes. Are you aware of such issue ? Any idea ? The index has been declared after  all 300K nodes have been created the it is supposed to be built upon start.

                         

                        I will try another batch to create 300K nodes but this time i will declare the index before creation. Perhaps the index will grow correctly in this case.

                        • 9. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
                          hchiorean

                          The query engine will only load the nodes that are returned by the index. So if the index returns 300k node keys (i.e. node ids), ModeShape will load each of the 300k nodes in batches, in memory, as you're reading the query results.

                          I hoped to have at least similar performances than NoSQL MongoDB for example with 300K documents stored and without any index. I mean ModeShape works in memory, all nodes are here.

                          ModeShape is a JCR spec implementation, not a NoSQL database. This means that most of the time, there is more processing involved and most likely a larger memory footprint per unit of storage. So this assumption is not reasonable IMO.

                          • 10. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
                            sberthez

                            The query engine will only load the nodes that are returned by the index. So if the index returns 300k node keys (i.e. node ids), ModeShape will load each of the 300k nodes in batches, in memory, as you're reading the query results.

                            Not really, i used preload setting and i have verified with Infinispan that all nodes are in memory then ModeShape is not supposed to load anything.

                             

                            Also, strange behavior, when I execute the explain plan of the query, i can clearly see that index i have declared will be used. But when i verify into the file system for the MapDB files, they seem to be very small compared to my other sample with only 30K nodes. There is something strange. When i kill the program because query do not respond after 30min i can see the MapDB file growing (hundreds MB) and then go back to small files. I have the feeling that indexes are computed but fail to be written into filesystem when i have 300K nodes. Are you aware of such issue ? Any idea ? The index has been declared after  all 300K nodes have been created the it is supposed to be built upon start.

                             

                            I will try another batch to create 300K nodes but this time i will declare the index before creation. Perhaps the index will grow correctly in this case.

                            ModeShape is a JCR spec implementation, not a NoSQL database. This means that most of the time, there is more processing involved and most likely a larger memory footprint per unit of storage. So this assumption is not reasonable IMO.

                            It will never be as fast, i understand that, but basically loading a node already in memory and verify only a value is not supposed to be so slow. Perhaps something is not working correctly.

                            • 11. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
                              sberthez

                              Back after some other tests. I have loaded another batch of data limited to 100K nodes and using distributed nodes to limit number of children per nodes. But this time i have declared the index before load :

                               

                              "DocumentByType" : {
                                      "kind" : "value",
                                      "nodeType" : "nt:base",
                                      "columns" : "jcr:primaryType(NAME)",      
                                      "provider" : "local"
                                 }

                              It is working now, the MapDB file is around 250MB and explain plan shows that index is here :

                               

                                    Index [doc] <INDEX_SPECIFICATION=DocumentByType, provider=local, cost~=100, cardinality~=100004, selectivity~=0.3318412, constraints=[doc.[jcr:primaryType] = 'my:document']>

                               

                              I have tuned Infinispan to be sure to have all nodes in memory, so ModeShape engine is not supposed to read into persistent cache store during query, everything is local or in memory. Good news is that is seems to be fast to run query :

                               

                              select doc.[jcr:uuid] from [my:document] as doc where doc.[my:buisness_unit]=1

                              Found count :33131

                              Query execution time :0.247s

                              33131 items found in 0.2s is really good knowing that "my:buisness_unit" is not indexed. Adding another index on the "where" part does not change anything on execution time :

                               

                                    Index [doc] <INDEX_SPECIFICATION=DocumentByStatus, provider=local, cost~=100, cardinality~=33119, selectivity~=0.33117014, constraints=[doc.[my:status] = 'Draft']>
                                    Index [doc] <INDEX_SPECIFICATION=DocumentByType, provider=local, cost~=100, cardinality~=100005, selectivity~=0.33230326, constraints=[doc.[jcr:primaryType] = 'my:document']>

                               

                              Running :select doc.[jcr:uuid] from [my:document] as doc where doc.[my:status]='Approved'

                              Found count :33684

                              Query execution time :0.274s

                               

                              Then basically, the index on node types has removed the traverse search and everything seems to be quick if have tuned correctly Infinispan to have everything in memory. I need now to test on 300K nodes but i am still facing issues during index construction. I think Infinispan cache max entries count for eviction is not big enough, i will continue to investigate.

                               

                              Also, index builder seems to be asynchronous, is there any way to build index on start but do not continue until index is not finished ? Does it exist any settings i could tune to enhance index build time ?

                               

                              Regards

                              • 12. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
                                hchiorean

                                There is no way to make reindexing synchronous in 4.4. It's something which we've just added in master (Query and search - ModeShape 4 - Project Documentation Editor) and will be available from 4.5 onwards. In the meantime, one possible workaround is to read the status of an index (modeshape/IndexManager.java at modeshape-4.4.0.Final · ModeShape/modeshape · GitHub) and block/sleep while it's reindexing.

                                 

                                Index updates however (i.e. adding data to indexes after a session.save call) are synchronous by default.

                                • 13. Re: ModeShape (4.4.0) local MapDB index seems not working with big data
                                  sberthez

                                  I did not know the IndexManager.getIndexStatus(), thanks for information. It could be enough. I need to understand why indexing does not seem to finish when using 300K nodes, i need more tests.