1 2 Previous Next 29 Replies Latest reply on Feb 20, 2013 9:39 AM by rhauch Go to original post
      • 15. Re: Error while trying to setup Tika text extractor in modeshape
        satyakishor.m

        I am planning to add a custom extractor to fix the problem. I saw your article about adding custom extractor but I didn't find the exact configuration details to use in the standalone.xml file. Can you please help me on this.

        • 16. Re: Error while trying to setup Tika text extractor in modeshape
          rhauch

          A custom extractor is set up just like any other extractor, with a few differences:

           

          1. The built-in extractors can use a short "alias" for the classname (while the full classname can be used instead), but custom extractors must specify the full classname
          2. Any extractor-specific fields (if there are any) can be set as attributes on the element.
          3. Create a JBoss AS7 module for your extractor library. You can use the "modules/org/modeshape/extractor/tika" module as an example, but there are quite a few examples in the AS7 installation and information online.

           

          Again, the only difference between configuring built-in and custom extractors is #1 above. Other than that, they're handled exactly the same way.

           

          For example, here's the example we provide in our "standalone-modeshape.xml":

           

          <text-extractor name="tika-extractor" classname="tika" module="org.modeshape.extractor.tika"/>

           

          Here's what a custom extractor "com.example.FooBarExtractor" would look like (if no fields had to set):

           

          <text-extractor name="foo-extractor" classname="com.example.FooBarExtractor" module="org.modeshape.extractor.tika"/>

          • 17. Re: Error while trying to setup Tika text extractor in modeshape
            satyakishor.m

            After investigating further on this issue I also found that there were 2 versions of apche poi jars (one from apache tika module and one from application). I removed all the dependencies of dom4j and apach poi from the application and I included the apache tika module in jboss-deployment-structure.xml as shown below.

             

            <jboss-deployment-structure>

                <deployment>     

                       <dependencies>

                              <module name="org.apache.tika" slot="1.2" services="import"/>

                    </dependencies>

                </deployment>

            </jboss-deployment-structure>

             

            Now, I am not running into class cast exception. Also Tika extractor is extracting correct text from the spread sheet and the search returns the correct results.

            • 18. Re: Error while trying to setup Tika text extractor in modeshape
              rhauch

              Fantastic!

              • 19. Re: Error while trying to setup Tika text extractor in modeshape
                satyakishor.m

                Randall,

                Thank you very much for your help on this issue. Now I am getting comfortable with Jboss and Modeshape.

                • 20. Re: Error while trying to setup Tika text extractor in modeshape
                  satyakishor.m

                  Currently I am working on rebuilding the index feature in Modeshape using org.modeshape.jcr.api.Workspace)session.getWorkspace()).reindexAsync("/"). I tested the reindex feature and it looks like rebuild doens't extract the contents from the attachments using tika extractor.

                  Is this the correct behaviour??

                  • 21. Re: Error while trying to setup Tika text extractor in modeshape
                    rhauch

                    Currently I am working on rebuilding the index feature in Modeshape using org.modeshape.jcr.api.Workspace)session.getWorkspace()).reindexAsync("/"). I tested the reindex feature and it looks like rebuild doens't extract the contents from the attachments using tika extractor.

                    Is this the correct behaviour??

                     

                    What do you mean by "it looks like rebuild doesn't extract the contents ..."? Are you looking for debug messages in the extraction process?

                     

                    The short answer is that ModeShape does not extract the text from the attachments (any BINARY values) during re-indexing. Explaining why is a bit more complicated because it gets into the inner workings of how ModeShape stores binary values.

                     

                    As we explain in our documentation, ModeShape stores all BINARY values in a separate area, which we call the BinaryStore. Applications use the ValueFactory to create a javax.jcr.Binary value from an InputStream, and it is at this point that ModeShape places the content into the BinaryStore. It first computes the content's SHA-1 hash (which is determined entirely by the content and will be exactly the same given the same content), and the looks to see if the BinaryStore is already storing content with that SHA-1. If it is, then nothing else needs to be done. If it is not, then ModeShape stores the content in the BinaryStore, and then (asynchronously) extracts the text and stores the extracted text in the BinaryStore (right alongside the content). Again, all of this happens when the application creates the javax.jcr.Binary value.

                     

                    Now, when the application sets a property such that it uses this javax.jcr.Binary value, the property really just stores the SHA-1. The application will then save its session, at which point ModeShape records which SHA-1s where used.

                     

                    When you reindex the content, ModeShape walks the subgraph that you're reindexing and simply indexes the property values of every node in that subgraph. When it gets to a javax.jcr.Binary value, it merely asks the BinaryStore for the already extracted text, and then indexes that extracted text.

                     

                    This is why reindexing does not cause the text to be re-extracted from the content.

                    • 22. Re: Error while trying to setup Tika text extractor in modeshape
                      satyakishor.m

                      You answered my question. We were trying to see whether we can reindex everything so that we can search repository incase of file/db index loss. As you said except for text extraction from attachments we can reindex the entire repository. I would love to see modeshape extracts the text from the actual attachments during reindex process. But I think we are fine for now. Thanks again for your help.

                      • 23. Re: Error while trying to setup Tika text extractor in modeshape
                        rhauch

                        I would love to see modeshape extracts the text from the actual attachments during reindex process.

                        There would be no point to doing this, as it would extract exactly the same text that it previously extracted. Again, the binary content (and thus the extracted text) for a given SHA-1 never changes; only the properties' references to the SHA-1s change when the client apps do so, and reindexing already handles these changes.

                        • 24. Re: Error while trying to setup Tika text extractor in modeshape
                          satyakishor.m

                          I think you misunderstood my previous question. Let me elaborate a bit. For example I stored some nodes and few attachments into repository. At this point I am able to search for both node attributes and attachment contents. Now if I truncate the index tables and search for node attributes or attachment contents, nothing comes up. If I reindex the workspace, search returns correctly if I search for node attributes, but if I search for attachment contents nothing comes up.

                          • 25. Re: Error while trying to setup Tika text extractor in modeshape
                            rhauch

                            Are you indexing asynchronously? If so, queries may return incomplete results until the indexing has been completed.

                             

                            If you're still having a problem, do you have (or can you create) a test case that replicates this situation?

                            • 26. Re: Error while trying to setup Tika text extractor in modeshape
                              satyakishor.m

                              Yes, I am indexing asynchronously and also waited enough time for indexing to complete before I search. I want to create a test case, can you please guide me through the steps to add the test case in modeshape using arquillian.

                              • 27. Re: Error while trying to setup Tika text extractor in modeshape
                                rhauch

                                Yes, I am indexing asynchronously and also waited enough time for indexing to complete before I search.

                                 

                                The "reindexAsynch" methods return a Future<Boolean>, which you can call "await()" to block your thread until the reindexing process has completed.

                                 

                                 

                                I want to create a test case, can you please guide me through the steps to add the test case in modeshape using arquillian.

                                 

                                Our sequencer integration tests that use AS7 is here: https://github.com/ModeShape/modeshape/blob/master/integration/modeshape-jbossas-integration-tests/src/test/java/org/modeshape/test/integration/SequencersIntegrationTest.java

                                 

                                If you want to create a standalone test case, then follow what they're doing. Otherwise, clone our GitHub repo, create a branch, and simply add a method to that test class that does what you want to do. Once that's done, commit locally and create a pull-request. See ModeShape Development Workflow

                                • 28. Re: Error while trying to setup Tika text extractor in modeshape
                                  satyakishor.m

                                  Randall,

                                   

                                  I just tested the reindex feature in my application and it actually is extracting the contents from the attachments during reindex process. For some reason it didn't work when I tested last time. May be I should have used some wrong configuration. Sorry for the confusion. Thanks for your support.

                                  • 29. Re: Error while trying to setup Tika text extractor in modeshape
                                    rhauch

                                    No worries. I'm glad you have it working now.

                                    1 2 Previous Next