10 Replies Latest reply on Aug 31, 2015 10:09 AM by sberthez

    How does index persistence work in ModeShape ?

    sberthez

      Hi,

       

      I have configured my infinispan used by my ModeShape 4.0 repository to use soft-index-file-store for data persistence :

       

      <persistence>

          <soft-index-file-store xmlns="urn:infinispan:config:soft-index:7.0">

              <index path="c:/temp/modeshape/cache/_master/index" />

              <data path="c:/temp/modeshape/cache/_master/data" />

          </soft-index-file-store>

      </persistence>

       

      If it is working fine to store and load data but indexes are always deleted upon shutdown and then cannot be re-used for next start. Is there a way to always have persistence for indexes ?

       

      Other, question, not sure they are related, when i upload a binary file and Tika parser parses the binary content, how does resulting fulltext index is stored ? When i restart the ModeShape server are they scanned and parsed again or a persistent result is used ?

       

      Regards

        • 1. Re: How does index persistence work in ModeShape ?
          hchiorean

          ModeShape indexes have nothing to do with Infinispan indexes. ModeShape does not use in any way, shape or form Infinispan's query feature.

           

          Instead, ModeShape's query & indexing functionality is built-in and in 4.x, you can still execute queries without any type of indexes defined. Indexing in ModeShape 4.x is mainly a way to optimize query execution, in that instead of all the repository data, only a certain subset will be searched & matched against your query criteria. You can read more about indexing and querying in our documentation: Query and search - ModeShape 4 - Project Documentation Editor Right now, the only index provider in ModeShape 4 (the local index provider) uses MapDB to store indexes on disk, separate from any repository data.

          Other, question, not sure they are related, when i upload a binary file and Tika parser parses the binary content, how does resulting fulltext index is stored ? When i restart the ModeShape server are they scanned and parsed again or a persistent result is used ?ef

          The local index provider (MapDB based) does not support (and as such does not store) FTS. Currently the way FTS works, is if text extraction is enabled, text is extracted and stored from binary values into the binary store. When the query engine encounters FTS criteria, it simply asks the binary stores for the stored text and matches the text against the search criteria.

          • 2. Re: How does index persistence work in ModeShape ?
            sberthez

            Thanks for your answer. I made a mistake, you are rigth, index i have seen correspond to index data used by the Infinispan soft-index cache store. It has nothing to do with ModeShape.

            When you use local index provider for ModeShape does it mean these data cannot be replicated to other clusters ? I suppose it cannot be.

            • 3. Re: How does index persistence work in ModeShape ?
              hchiorean

              Even though indexes are stored locally on each node, they are replicated based on remote events that are processed across the cluster (on top of JGroups). You can read more about clustered indexes in this thread: Re: How to keep cluster indexes synchronized

              • 4. Re: How does index persistence work in ModeShape ?
                sberthez

                Thanks, your link helps a lot for the part related to data indexes.

                 

                The local index provider (MapDB based) does not support (and as such does not store) FTS. Currently the way FTS works, is if text extraction is enabled, text is extracted and stored from binary values into the binary store. When the query engine encounters FTS criteria, it simply asks the binary stores for the stored text and matches the text against the search criteria.

                 

                Not sure to clearly understand what you mean for FTS, binary store and extractors. For example, i use the FileSystemBinaryStore store,  so my binary content (files in my case) is stored into the filesystem. I have also plugged in the tikaExtractor to enable text extraction. What I do not understand is where extraction result is stored. Do you mean tikaExtractor uses the FilesystemBinaryStore used to store binaries configured for binary store ? If I use another binary store, for example DatabaseBinaryStore, does it mean tikaExtractor will also use it ? Basically, does it mean the way text extraction is stored depends on binary store type you use ? Is it possible to separate the way binary is stored and text extraction is stored ? For example, use filesystem to store binary value but use an Infinispan store to store extracted text in order to speed up FTS queries ?

                 

                Also, can you please clarify the way FTS queries are working ? For example, when you search for a simple word without any other predicates, does it mean that engine will load one by one all extracted text for all nodes and then search into it or is there any indexes somewhere ?  Does it rely on search engine use by the binary store (for example Windows full text search when files are stored into Windows OS) ? Basically i am wondering what are performances of FTS search when FileSystemBinaryStore is used  and when i will have a lot of binary files nodes. Do you think it would be better to use another kind of binary store for better FTS search performances ?

                 

                Regards

                • 5. Re: How does index persistence work in ModeShape ?
                  hchiorean

                  Not sure to clearly understand what you mean for FTS, binary store and extractors.

                  By FTS I mean full text search. For binary stores and extractors in general, I suggest you read the ModeShape documentation: https://docs.jboss.org/author/display/MODE40/Home. Full text search in general, has nothing to do with binary stores or text extractors specifically and is something that pertains to JCR queries (as defined by the spec).

                   

                  ModeShape offers the text extraction feature as an additional feature (the JCR spec has no such thing) which relates only to binary values. In other words, if you enable text extraction, ModeShape will use Apache Tika whenever you create a binary property in the repository and will attempt to read that binary property and extract text content from it, which it will store in addition to the binary value in the same binary store. The feature is that once the extracted text is stored, it can be queried-for using FTS  queries on those binary properties: JCR-SQL2 - ModeShape 4 - Project Documentation Editor This means that for text extraction to work, you should have a binary store configured and the extracted text is always stored in the same place as the binary values. Query-performance wise a local store (FS / ISPN) will be faster than a remote one (a DB across the network).

                  Also, can you please clarify the way FTS queries are working ? For example, when you search for a simple word without any other predicates, does it mean that engine will load one by one all extracted text for all nodes and then search into it or is there any indexes somewhere ? 

                  ModeShape 3 used Apache Lucene for performing the full text search queries, while atm. ModeShape 4 uses simple regex matching. When you perform a FTS query, according to the JCR spec, you must specify a selector and either a property name or * meaning all properties. Behind the scenes, if the property is a binary property, ModeShape will ask the binary store for the extracted text of only that binary property and if one doesn't yet exist, it will extract & store that text. The extracted text value is then used against the CONTAINS function.
                  I recommend you create a test-case/example locally which mimics your use case and you test the performance yourself. It is highly dependent on the your node types, the actual queries being executed etc. Not so much on the type of binary store.
                  • 6. Re: How does index persistence work in ModeShape ?
                    sberthez

                    Thanks a lot, it is more clear now.

                     

                    extracted text is always stored in the same place as the binary values. Query-performance wise a local store (FS / ISPN) will be faster than a remote one (a DB across the network).

                    It is the part I missed, i understand now.

                     

                    ModeShape 3 used Apache Lucene for performing the full text search queries, while atm. ModeShape 4 uses simple regex matching. When you perform a FTS query, according to the JCR spec, you must specify a selector and either a property name or * meaning all properties. Behind the scenes, if the property is a binary property, ModeShape will ask the binary store for the extracted text of only that binary property and if one doesn't yet exist, it will extract & store that text. The extracted text value is then used against the CONTAINS function.

                    Yes, i have done more test and analyzed code of FileSystemBinaryStore. I understand better now how binary key is computed and also where extracted text is stored. I can also see extraction triggered and stored upon search. Not very friendly with performances because the 1st user who will access to this specific binary value will face slowdown caused by extraction and storage. It could be interesting to have asynchronous process to automatically extract/store text once binary is saved. Is it something we can configure ? Otherwise I suppose it is something we can do in ModeShape using listeners. If not, i will force extraction with a job querying  asynchronously on new/updated binaries nodes.

                    • 7. Re: How does index persistence work in ModeShape ?
                      hchiorean

                      Not very friendly with performances because the 1st user who will access to this specific binary value will face slowdown caused by extraction and storage. It could be interesting to have asynchronous process to automatically extract/store text once binary is saved. Is it something we can configure ? Otherwise I suppose it is something we can do in ModeShape using listeners. If not, i will force extraction with a job querying  asynchronously on new/updated binaries nodes.

                      Technically, text extraction is performed asynchronously, but the "main" and extractor threads are synchronized: https://github.com/ModeShape/modeshape/blob/master/modeshape-jcr/src/main/java/org/modeshape/jcr/value/binary/AbstractBinaryStore.java#L137
                      and https://github.com/ModeShape/modeshape/blob/master/modeshape-jcr/src/main/java/org/modeshape/jcr/TextExtractors.java#L115

                      Also, text extraction *is not* performed whenever a binary is uploaded but only the 1st time someone queries that binary via FTS (for obvious performance reasons, there is no point in preemptively extracting text and filling a binary store if a user never wants to perform FTS). As such, "true" asynchronous extraction would never work because query results have to be consitent.

                      • 8. Re: How does index persistence work in ModeShape ?
                        sberthez

                        Also, text extraction *is not* performed whenever a binary is uploaded but only the 1st time someone queries that binary via FTS (for obvious performance reasons, there is no point in preemptively extracting text and filling a binary store if a user never wants to perform FTS). As such, "true" asynchronous extraction would never work because query results have to be consitent.

                         

                        I would say the opposite. If users wait for results because the engine need to extract text on first time, it is not really a performance benefit for user's point of view. Basically, asynchronous full text extraction is something usually done for most ECM product i know (ex : EMC Documentum and Full Text server). Full text consistency is not a problem, users usually accept there is a time to wait for data to be ingested into the database. But, i can also understand it is a choice to not overload ModeShape engine, no problem. It is also a consequence of choice to have text extraction done in the same sever. It really depends on application type and if your binary values are often updated or not. For example, if your application is mostly dedicated to read, you expect to have files correctly integrated at the same time as your file is imported into the base. It could be interesting to have the possibility to configure somewhere a list of binary properties we want to extract directly (asynchronously) in the same way as you can configure indexed properties. Anyway, i can deal with that, i will develop my own asynchronous request to force text extraction or use listeners. Thanks.

                        • 9. Re: How does index persistence work in ModeShape ?
                          hchiorean

                          Full text consistency is not a problem, users usually accept there is a time to wait for data to be ingested into the database.

                          I have to disagree: full text is nothing more than a feature defined as part of JCR query (via the CONTAINS function) together with all the other JCR-SQL functions. As such, you can't have a query that returns a set of results once and then on subsequent runs another set of results, based on "async stuff" going on in the background, provided none of the repository data - and by that I mean nodes & properties - changes.

                          • 10. Re: How does index persistence work in ModeShape ?
                            sberthez

                            I have to disagree: full text is nothing more than a feature defined as part of JCR query (via the CONTAINS function) together with all the other JCR-SQL functions. As such, you can't have a query that returns a set of results once and then on subsequent runs another set of results, based on "async stuff" going on in the background, provided none of the repository data - and by that I mean nodes & properties - changes.

                             

                            I have many years of experiences in EMC Documentum engine that is one of leaders in ECM market and i can guarantee that delay to ingest data is not a problem at all for users. But they are not happy when they need to wait for a search result... In such systems, you have a dedicated server (as dedicated Lucene server for example) to generate full text data and respond to full text requests. Full text indexing are asynchronous and managed by a queue, so during a short laps of time, full text requests do not return your files. It has never been problem and it is commonly accepted.

                             

                            Anyway, it is not a problem, i can completely understand also ModeShape's choice to extract on demand because in this case the work in done locally and not by a dedicated server. In this case it seems to be the good choice to not overload the server. I think i will use a dedicated ModeShape cluster for extraction work in order to simulate a kind of "fulltext server" and have text extracted/stored before search is done by someone. But i will need to use shared binary store.

                             

                            Once again, thanks for your great help. Another point of view is always interesting to read.