5 Replies Latest reply on Mar 15, 2011 7:38 PM by rhauch

Question on Lucene indexing

jonathandfields Mar 15, 2011 6:12 PM

Hi,

I've been experimenting with the file system connector and searching (free text search) and was pleasantly surprised when a file (.txt extension) that I had created on the file system outside of modeshape had been indexed in Lucene (it was found in a search).

This may be a FAQ, or may be in the documentation (please point me to it if so), but what is the trigger for indexing in such a case? Is there a background thread scanning the file system for new or changed files? Is that in any way configurable?

Or am I missing the big picture that's a part of sequencing?

Thanks!

1. Re: Question on Lucene indexing

rhauch Mar 15, 2011 7:39 PM (in response to jonathandfields)
I just checked the Reference Guide, and the FileSystem Connector chapter doesn't really talk about how the connector will generally reflect the file system content, even when files and/or directories are added under the primary directory for the connector. But that's why ModeShape recognized your newly-added file. But at the present time, the connector does not publish events for these discoveries.

Now, ModeShape also maintains the Lucene indexes for query/search purposes, and there are a couple of times that the component that manages Lucene re-indexes some or all of the content:
upon startup - this always happens, though we'll be changing this to make it smarter
changes were made to a part of the repository (Lucene watches the events), but the changes are too complex and/or don't contain all the information necessary for indexing. Sometimes the events contain all the information necessary for indexing (e.g., creation of nodes, addition of properties, etc).
changing the location of the Lucene indexes (tho a restart is required for this to take effect)

So I suspect that one of these conditions occurred, causing ModeShape to reindex at least part of your content on the file system.

At the present time, the File System connector does not publish events when it determines the content changes. This is primarily because there's no easy way in Java 1.6 to watch the file system for changes. Once we start using Java 7, however, we should be able to implement this so that when files/directories are added or changed or removed under the connector's primary directory, the connector will generate the appropriate event and the Lucene indexes will be updated immediately.
Actions
2. Question on Lucene indexing

jonathandfields Mar 15, 2011 7:05 PM (in response to jonathandfields)

I stand corrected. I misread my search results. The match was not on the jcr:data property of the jcr:content node of the nt:file, but rather was on the node name of the parent nt:folder. It appears that a full text search over a file system connector matches on the node names, but not on the property (file) contents. Is that correct? Where can I determine the behavior of queries over file system connector?

Thanks!
Actions
3. Question on Lucene indexing

jonathandfields Mar 15, 2011 7:13 PM (in response to jonathandfields)

Looks like our responses crossed paths... Thanks for the info.

It does look like what I originally said is not true - the hit was not on the file contents, but on the node name of the parent folder.

I did change from the in memory Lucene index storage to file system storage after the file was created. Perhaps that triggered some re-indexing.

I'll dig a bit deeper.

Thanks.
Actions
4. Question on Lucene indexing

rhauch Mar 15, 2011 7:37 PM (in response to jonathandfields)

I presume your query used a full-text search? Such criteria are evaluated against all properties whose property definition does not preclude full-text search (via the "nofulltext" attribute, mentioned in Sections 6.7.19 and 25.2.3 of the JCR 2 specification). Note that property definitions by default do not have this attribute and are thus searchable. Basically, all other non-BINARY property values are converted to STRING values and indexed for used in full-text searches.

Notice that I said "non-BINARY" properties. With versions 2.4 and earlier, ModeShape had no mechanism to convert a BINARY value into indexable characters. And since "jcr:data" is a binary property, ModeShape 2.4 (or earlier) never indexed the content of files for full-text searching.

However, this will be changing in ModeShape 2.5 with the introduction of Text Extractors. Each text extractor describes the MIME types that it supports, and ModeShape determines the MIME type of BINARY property values and uses the appropriate extractor to get tokens that can be indexed for full-text search. This means that ModeShape 2.5 will also index the content of files stored in the repository, as long as there is an extractor for that file's MIME type.

The text extractor feature is already committed into the 'master' branch, along with some updates to the Reference Guide to talk about how they work, how to configure them (they are not enabled by default), which extractors come out-of-the-box (e.g., MS office files, PDFs, text files, Teiid VDBs, etc.) and via Apache Tika, and how to implement extractors for your own file formats. It will be included in the 2.5.0.Beta1 release (hopefully this week), and we'd love to have feedback about how it works! If you've feeling adventurous, get the latest from 'master' and take it for a spin. (Note that 'master' now requires Maven 3 and builds significantly faster.) You can build 'master' to produce all the artifacts (including documentation, with:

mvn -Passembly clean install

Let us know if you try this and have problems.
Actions
5. Question on Lucene indexing

rhauch Mar 15, 2011 7:38 PM (in response to jonathandfields)

I did change from the in memory Lucene index storage to file system storage after the file was created. Perhaps that triggered some re-indexing.
Ah, yes. That's another condition that causes re-indexing. I'll update my response above.
Actions

Go to original post