1 2 Previous Next 27 Replies Latest reply on Oct 16, 2012 4:11 PM by pkutrhari

    Yet another full text search question

    nl

      Hi,

       

      My repository contains the file "jcr-spec.doc".

       

      Now I run the following two text search queries (ModeShape 2.8.2), one with the "Search" other with "JCR-SQL2" language:

       

       

      {code}

      qmgr.createQuery("Repository Content", "Search").execute();

      qmgr.createQuery("SELECT [jcr:path] FROM [nt:base] WHERE contains([nt:base],'Repository Content')", Query.JCR_SQL2).execute()

      {code}

       

      The documentation reads that both queries should do the same but do not:

       

      {noformat}

      +---+----------------------------------------------------------------------------------------------------------------------------------------------+----------------+

      | # | Location(Results)                                                                                                                            | Score(Results) |

      +---+----------------------------------------------------------------------------------------------------------------------------------------------+----------------+

      | 1 | </{}files/{}jcr-spec.doc/{http://www.jcp.org/jcr/1.0}content && [{http://www.modeshape.org/1.0}uuid = 43ae66b0-7279-459d-b65a-68d53bc88476]> | 0.18370658     |

      +---+----------------------------------------------------------------------------------------------------------------------------------------------+----------------+

      +---+----------+-------------------+----------------+

      | # | jcr:path | Location(nt:base) | Score(nt:base) |

      +---+----------+-------------------+----------------+

      +---+----------+-------------------+----------------+

      {noformat}

       

      What's wrong?

       

      Thanks,

       

      Niels

        • 1. Re: Yet another full text search question
          hchiorean

          Hi Niels,

           

          I think the documentation is "misleanding": the first parameter of CONTAINS should be a property name or "all the properties" like so: [nt:base].*

          • 2. Re: Yet another full text search question
            nl

            Hi Horia,

             

            but even if I use

             

            qmgr.createQuery("SELECT [jcr:path] FROM [nt:base] WHERE contains([nt:base].*,'Repository Content')", Query.JCR_SQL2).execute())

             

            I still get no results.

            • 3. Re: Yet another full text search question
              hchiorean

              Is the string "Repository Content" present on a node property in the repository ? (meaning can you retrieve that property via node.getProperty and confirm that it contains this text)

              • 4. Re: Yet another full text search question
                nl

                Tthe text is contained in document content, which is stored in the jcr:data property of a file node:

                 

                 

                {noformat}

                files jcr:primaryType=nt:folder jcr:mixinTypes=[mode:publishArea]

                   - jcr:created=2012-07-18T10:59:35.779+02:00

                   - jcr:createdBy="nl"

                   jcr-spec.doc jcr:primaryType=nt:file jcr:mixinTypes=[mix:referenceable,mix:title] jcr:uuid=538a666f-ea51-4277-ad28-8a221446605d

                     - jcr:created=2012-07-18T10:59:35.669+02:00

                     - jcr:createdBy="nl"

                     jcr:content jcr:primaryType=nt:resource jcr:mixinTypes=[mix:language]     

                       - jcr:data=binary (2,55MB, SHA1=94f7889b5fe96cbba493630cbe7cb678a5b7ccc1)

                       - jcr:encoding="windows-1252"

                       - jcr:language="en"

                       - jcr:lastModified=2012-07-18T10:59:35.779+02:00

                       - jcr:lastModifiedBy="nl"

                       - jcr:mimeType="application/msword"

                {noformat}

                 

                • 5. Re: Yet another full text search question
                  hchiorean

                  To be able to "full text search" the content from the .doc, you need to have a text extractor configured which extracts the doc content and makes it available to the Lucene indexer. Do you have such an extractor configured ?

                   

                  e.g.

                   

                     <mode:textExtractors>

                          <mode:textExtractor jcr:name="Tika Extractor">

                              <mode:description>Text extractor that uses the Tika library of parsers</mode:description>       

                              <mode:classname>org.modeshape.extractor.tika.TikaTextExtractor</mode:classname>

                          </mode:textExtractor>

                      </mode:textExtractors>

                  • 6. Re: Yet another full text search question
                    nl

                    I have the text extractor configured (see attached), otherwise I'd wonder why the first query ("Search") language matches the document.

                    • 7. Re: Yet another full text search question
                      hchiorean

                      what about if you change the node type in the query from [nt:base] to [nt:resource] ? (IIRC, nt:resource is not an nt:base)

                      • 8. Re: Yet another full text search question
                        nl

                        {code}qmgr.createQuery("SELECT [jcr:path] FROM [nt:resource] WHERE contains([nt:resource].*,'Repository Content')", Query.JCR_SQL2).execute();{code}

                         

                        also delivers no result.

                        • 9. Re: Yet another full text search question
                          hchiorean

                          well, this seems like a bug to me (I'll wait to see what Randall says about this and if it's a bug, we'll probably open a JIRA issue)

                          • 10. Re: Yet another full text search question
                            nl

                            Further question:

                             

                            While

                             

                            {code}qmgr.createQuery("SELECT [jcr:path] FROM [nt:resource] WHERE contains([nt:resource].*,'Repository Content')", Query.JCR_SQL2).execute();{code}

                             

                            searches all properties of nt:resource for the given text, is it possible to restrict the search to a file's content with the following query

                             

                            {code}qmgr.createQuery("SELECT [jcr:path] FROM [nt:resource] WHERE contains([nt:resource].[jcr:data],'Repository Content')", Query.JCR_SQL2).execute();{code}

                             

                            ?

                             

                            Thanks, Niels

                            • 11. Re: Yet another full text search question
                              hchiorean

                              In general (meaning properties with text values) yes. However, afaik, values extracted via a text extractor are not indexed with a specific property of the node, so you need the * operator in your search.

                              • 12. Re: Yet another full text search question
                                nl

                                I have the bad feeling that you were right with your assumption that there is no text extractor running (or at least not before my query fires).

                                 

                                I know that the sequencer services runs async and I suppose that the text extractor is also running async, isn't it?

                                • 13. Re: Yet another full text search question
                                  hchiorean

                                  I have the same feeling . Take a look at https://issues.jboss.org/browse/MODE-1560 (I just opened it).

                                   

                                  If you look inside your runtime classpath, is there, by any chance, a version of apache-poi < 3.8 ? If yes, then no text is extracted and the error is silently ignored (if you can debug, have a look at this: org.modeshape.search.lucene.LuceneSearchSession#702).

                                  To answer your question: the text is extracted during the indexing process, which is async.

                                  • 14. Re: Yet another full text search question
                                    nl

                                    Hi,

                                     

                                    I'm using modeshape-extractor-tika-2.8.2.Final-jar-with-dependencies.jar and this includes apache-poi 3.8-beta4.

                                    But thanks to your debugging hint I've found the error.

                                     

                                    TikaTextExtractor throws an exception: "org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available)."

                                     

                                    ModeShape is recording this exception in the TextExtractorContext but not using/reporting it in any way.

                                     

                                    So what I need is to increase the size of the of the BodyContentHandler. But this is hardcoded with default size...

                                     

                                    What I now do is using my "own" TextExtractor (which is a copy of yours, but with and increased size write limit) and *surprise, surprise* it is working !!!

                                     

                                    This is of course no very satisfying and I'd wish you'll find a way to make this size configurable or increase it by default.

                                    1 2 Previous Next