1 2 Previous Next 27 Replies Latest reply on Oct 16, 2012 4:11 PM by pkutrhari

Yet another full text search question

nl Jul 18, 2012 3:35 AM

Hi,

My repository contains the file "jcr-spec.doc".

Now I run the following two text search queries (ModeShape 2.8.2), one with the "Search" other with "JCR-SQL2" language:

{code}
qmgr.createQuery("Repository Content", "Search").execute();
qmgr.createQuery("SELECT [jcr:path] FROM [nt:base] WHERE contains([nt:base],'Repository Content')", Query.JCR_SQL2).execute()
{code}

The documentation reads that both queries should do the same but do not:

{noformat}
+---+----------------------------------------------------------------------------------------------------------------------------------------------+----------------+
| # | Location(Results)                                                                                                                            | Score(Results) |
+---+----------------------------------------------------------------------------------------------------------------------------------------------+----------------+
| 1 | </{}files/{}jcr-spec.doc/{http://www.jcp.org/jcr/1.0}content && [{http://www.modeshape.org/1.0}uuid = 43ae66b0-7279-459d-b65a-68d53bc88476]> | 0.18370658     |
+---+----------------------------------------------------------------------------------------------------------------------------------------------+----------------+
+---+----------+-------------------+----------------+
| # | jcr:path | Location(nt:base) | Score(nt:base) |
+---+----------+-------------------+----------------+
+---+----------+-------------------+----------------+
{noformat}

What's wrong?

Thanks,

Niels

1. Re: Yet another full text search question

hchiorean Jul 18, 2012 4:01 AM (in response to nl)

Hi Niels,

I think the documentation is "misleanding": the first parameter of CONTAINS should be a property name or "all the properties" like so: [nt:base].*
Actions
2. Re: Yet another full text search question

nl Jul 18, 2012 4:48 AM (in response to hchiorean)

Hi Horia,

but even if I use

qmgr.createQuery("SELECT [jcr:path] FROM [nt:base] WHERE contains([nt:base].*,'Repository Content')", Query.JCR_SQL2).execute())

I still get no results.
Actions
3. Re: Yet another full text search question

hchiorean Jul 18, 2012 4:55 AM (in response to nl)

Is the string "Repository Content" present on a node property in the repository ? (meaning can you retrieve that property via node.getProperty and confirm that it contains this text)
Actions

4. Re: Yet another full text search question

nl Jul 18, 2012 5:03 AM (in response to hchiorean)

Tthe text is contained in document content, which is stored in the jcr:data property of a file node:

 
{noformat}
files jcr:primaryType=nt:folder jcr:mixinTypes=[mode:publishArea]
   - jcr:created=2012-07-18T10:59:35.779+02:00
   - jcr:createdBy="nl"
   jcr-spec.doc jcr:primaryType=nt:file jcr:mixinTypes=[mix:referenceable,mix:title] jcr:uuid=538a666f-ea51-4277-ad28-8a221446605d
     - jcr:created=2012-07-18T10:59:35.669+02:00
     - jcr:createdBy="nl"
     jcr:content jcr:primaryType=nt:resource jcr:mixinTypes=[mix:language]      
       - jcr:data=binary (2,55MB, SHA1=94f7889b5fe96cbba493630cbe7cb678a5b7ccc1)
       - jcr:encoding="windows-1252"
       - jcr:language="en"
       - jcr:lastModified=2012-07-18T10:59:35.779+02:00
       - jcr:lastModifiedBy="nl"
       - jcr:mimeType="application/msword"
{noformat}

5. Re: Yet another full text search question

hchiorean Jul 18, 2012 5:09 AM (in response to nl)

To be able to "full text search" the content from the .doc, you need to have a text extractor configured which extracts the doc content and makes it available to the Lucene indexer. Do you have such an extractor configured ?

e.g.

   <mode:textExtractors>
        <mode:textExtractor jcr:name="Tika Extractor">
            <mode:description>Text extractor that uses the Tika library of parsers</mode:description>
            <mode:classname>org.modeshape.extractor.tika.TikaTextExtractor</mode:classname>
        </mode:textExtractor>
    </mode:textExtractors>
Actions
6. Re: Yet another full text search question

nl Jul 18, 2012 5:16 AM (in response to hchiorean)
I have the text extractor configured (see attached), otherwise I'd wonder why the first query ("Search") language matches the document.

modeshape-config.xml 8.0 KB
Actions
7. Re: Yet another full text search question

hchiorean Jul 18, 2012 5:22 AM (in response to nl)

what about if you change the node type in the query from [nt:base] to [nt:resource] ? (IIRC, nt:resource is not an nt:base)
Actions
8. Re: Yet another full text search question

nl Jul 18, 2012 5:29 AM (in response to hchiorean)
{code}qmgr.createQuery("SELECT [jcr:path] FROM [nt:resource] WHERE contains([nt:resource].*,'Repository Content')", Query.JCR_SQL2).execute();{code}

also delivers no result.
Actions
9. Re: Yet another full text search question

hchiorean Jul 18, 2012 5:54 AM (in response to nl)

well, this seems like a bug to me (I'll wait to see what Randall says about this and if it's a bug, we'll probably open a JIRA issue)
Actions
10. Re: Yet another full text search question

nl Jul 18, 2012 6:02 AM (in response to hchiorean)
Further question:

While

{code}qmgr.createQuery("SELECT [jcr:path] FROM [nt:resource] WHERE contains([nt:resource].*,'Repository Content')", Query.JCR_SQL2).execute();{code}

searches all properties of nt:resource for the given text, is it possible to restrict the search to a file's content with the following query

{code}qmgr.createQuery("SELECT [jcr:path] FROM [nt:resource] WHERE contains([nt:resource].[jcr:data],'Repository Content')", Query.JCR_SQL2).execute();{code}

?

Thanks, Niels
Actions
11. Re: Yet another full text search question

hchiorean Jul 18, 2012 6:20 AM (in response to nl)

In general (meaning properties with text values) yes. However, afaik, values extracted via a text extractor are not indexed with a specific property of the node, so you need the * operator in your search.
Actions
12. Re: Yet another full text search question

nl Jul 18, 2012 6:47 AM (in response to hchiorean)

I have the bad feeling that you were right with your assumption that there is no text extractor running (or at least not before my query fires).

I know that the sequencer services runs async and I suppose that the text extractor is also running async, isn't it?
Actions
13. Re: Yet another full text search question

hchiorean Jul 18, 2012 6:55 AM (in response to nl)

I have the same feeling . Take a look at https://issues.jboss.org/browse/MODE-1560 (I just opened it).

If you look inside your runtime classpath, is there, by any chance, a version of apache-poi < 3.8 ? If yes, then no text is extracted and the error is silently ignored (if you can debug, have a look at this: org.modeshape.search.lucene.LuceneSearchSession#702).
To answer your question: the text is extracted during the indexing process, which is async.
Actions
14. Re: Yet another full text search question

nl Jul 18, 2012 7:57 AM (in response to hchiorean)

Hi,

I'm using modeshape-extractor-tika-2.8.2.Final-jar-with-dependencies.jar and this includes apache-poi 3.8-beta4.
But thanks to your debugging hint I've found the error.

TikaTextExtractor throws an exception: "org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available)."

ModeShape is recording this exception in the TextExtractorContext but not using/reporting it in any way.

So what I need is to increase the size of the of the BodyContentHandler. But this is hardcoded with default size...

What I now do is using my "own" TextExtractor (which is a copy of yours, but with and increased size write limit) and *surprise, surprise* it is working !!!

This is of course no very satisfying and I'd wish you'll find a way to make this size configurable or increase it by default.
Actions

1 2 Previous Next

Go to original post