1 2 Previous Next 25 Replies Latest reply: Feb 24, 2012 10:56 AM by Sanne Grinovero RSS

Expiration and Lucene Directory

Horacio Vico Newbie

Hi!,

 

Yesterday we were discussing in another thread about eviction using Infinispan as a Lucene Directory (I didn't configure eviction and got out of memory errors in my Lucene app).

 

Now I ask about expiration, because the row count in my cachestore table keeps growing (now I suppose memory is limited by the eviction process).

 

Could I configure expiration for an Infinispan cache which is working as a Lucene directory, or is this dangerous (possible index losses)? I was thinking about configuring this with a long period of time, like "30 days" or something.

 

And another little question. Is it possible to retrieve an infinispan cache size in kilobytes or something? I use the size method that the API provides, but I get the number of keys in the cache and not its size.

 

Thanks,

  • 1. Re: Expiration and Lucene Directory
    Galder Zamarreño Master

    Re: cache size

     

    This is not currently doable. Calculating sizes of objects is not an easy task in the JVM.

  • 2. Re: Expiration and Lucene Directory
    Horacio Vico Newbie

    Hi,

     

    I am having trouble with my Lucene indexes tables, this tables keep growing indefinitely. I configured eviction so now I do not have out of memory errors, but what about expiration? My table grew from 400 Mbytes to 4 Gigabytes in a couple of weeks. 

     

    Thanks,

  • 3. Re: Expiration and Lucene Directory
    Sanne Grinovero Master

    Hi Horacio,

    expiration sounds dangerous; Even if you would configure a week for expiry, it means that index segments or chunks which where created more than a week ago will be removed, even if they are still a required part of the index, I wouldn't recommend that, unless you have a very specific index usage for which you know that's not going to be a problem (like rebuilding the index every night).

     

    4GB of Index is not unusual, so the real question is if this should be expected by your usage of it? Did you compare the index size to the same application using a filesystem index?

     

    We do have some functional tests to guard against information leak in the lucene-directory module, but maybe it's not covering us well enough for your use case. Would you be able to create a testcase close to your use case? Please take a look into the sources for examples of tests, and feel free to ask for help. Even if you could send me a draft of a test that would be great.

  • 4. Re: Expiration and Lucene Directory
    Horacio Vico Newbie

    Hi Sanne,

     

    Based on your experience, is it normal for a index to grow ten times its size, in an application where new documents are created at a really slow rate? In my application I have 40.000 documents, and in a week users generate less than 100 new documents. So, why the cachestore table grow from 400 Mbytes from a freshly reindexed state, to 4 gigabytes after a couple of weeks of mainly "read-only" usage? Is it a normal Lucene behaviour?

     

    I did not have this kind of trouble using filesystem based indexed (without Infinispan). I test this, but would like to know your opinion about this numbers.

  • 5. Re: Expiration and Lucene Directory
    Sanne Grinovero Master

    No that doesn't look normal. Still, it might need to duplicate the size of some segments while it's re-writing it, and if you're not optimizing it it might be quite lazy in compacting.

     

    So it is normal to need at least twice the average index size in terms of free space for intermediate works, but 10X seems very unlikely indeed.

     

    Are you applying any IndexWriter tuning options?

  • 6. Re: Expiration and Lucene Directory
    Horacio Vico Newbie

    A good workaround would be to reindex from scratch at night.

     

    I tried that approach, but after clearing the full text indexes via HSearch's API (FullTextSession purgeAll method), I do not get any deletions at my cachestore tables. It seems the cache keeps old indexes and it add the new ones on top of that (so the size problem increases). The only way I managed to make a fresh rebuild is following this steps:

     

    1) Shutdown my cluster

    2) Drop or truncate the cachestore tables via SQL

    3) Start a cluster node (the tables are created at startup by Infinispan)

    4) Rebuild indexes

    5) Start the other nodes

     

    That process is really uncomfortable, as it requires a full cluster shutdown.  So maybe I am doing something wrong. Why the cachestore isn't cleaned when I purge all my entities? When I optimize my indexes I noticed the same behaviour, I do not see any reduction of the cachestore size.

     

    Thanks,

  • 7. Re: Expiration and Lucene Directory
    Sanne Grinovero Master

    Hi Horacio,

    could you please post both your Hibernate Search and Infinispan configuration files?

    I need to reproduce your issue.

  • 8. Re: Expiration and Lucene Directory
    Horacio Vico Newbie

    Backend node (where new documents are generated and saved):

     

    Backend node, infinispan.xml:

     

    <?xml version="1.0" encoding="UTF-8"?>

    <infinispan

        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

        xsi:schemaLocation="urn:infinispan:config:4.2 http://www.infinispan.org/schemas/infinispan-config-4.2.xsd"

        xmlns="urn:infinispan:config:4.2">

      <global>

            <globalJmxStatistics enabled="false" cacheManagerName="HibernateSearch" allowDuplicateDomains="true" />

            <transport clusterName="HibernateSearch-Infinispan-cluster"  distributedSyncTimeout="50000">

                <properties>

                    <property name="configurationFile" value="jgroups3.xml"/>

                </properties>

            </transport>

            <shutdown  hookBehavior="DONT_REGISTER" />

      </global>

    <default>

            <locking lockAcquisitionTimeout="20000" writeSkewCheck="false"  concurrencyLevel="5000" useLockStriping="false" />

            <invocationBatching enabled="true" />

            <jmxStatistics enabled="false" />

            <eviction maxEntries="-1" strategy="NONE" />

            <expiration maxIdle="-1" />

                        <clustering mode="replication">

                <stateRetrieval timeout="60000"  logFlushTimeout="65000" fetchInMemoryState="true" alwaysProvideInMemoryState="true" />

                <sync replTimeout="50000" />

                <l1 enabled="false" />

            </clustering>

    </default>

      <namedCache name="LuceneIndexesLocking">

            <clustering mode="replication">

                <stateRetrieval fetchInMemoryState="true" logFlushTimeout="300000" />

                <sync replTimeout="500000" />

                <l1 enabled="false" />

            </clustering>

            <locking lockAcquisitionTimeout="20000" writeSkewCheck="false" concurrencyLevel="5000" useLockStriping="false" />

        </namedCache>

    <namedCache name="LuceneIndexesMetadata">

            <clustering mode="replication">

                <stateRetrieval fetchInMemoryState="true" logFlushTimeout="300000" />

                <sync replTimeout="50000" />

                <l1 enabled="false" />

            </clustering>

            <locking lockAcquisitionTimeout="20000" writeSkewCheck="false" concurrencyLevel="5000" useLockStriping="false" />

          <loaders shared="true" preload="true">

             <loader class="org.infinispan.loaders.jdbc.stringbased.JdbcStringBasedCacheStore" fetchPersistentState="true" ignoreModifications="false" purgeOnStartup="false">

                <properties>

                   <property name="key2StringMapperClass" value="org.infinispan.lucene.LuceneKey2StringMapper" />

                   <property name="createTableOnStart" value="true" />

                   <property name="datasourceJndiLocation" value="java:/MyDatasource" />

                   <property name="connectionFactoryClass" value="org.infinispan.loaders.jdbc.connectionfactory.ManagedConnectionFactory" />

                   <property name="dataColumnType" value="BLOB" />

                   <property name="idColumnType" value="VARCHAR(256)" />

                   <property name="idColumnName" value="idCol" />

                   <property name="dataColumnName" value="dataCol" />

                   <property name="stringsTableNamePrefix" value="LuceneIndexesMetadata" />

                   <property name="timestampColumnName" value="timestampCol" />

                   <property name="timestampColumnType" value="BIGINT" />

                </properties>

                <async enabled="true" flushLockTimeout="2500" shutdownTimeout="7200" threadPoolSize="5" />

             </loader>

          </loaders>

       </namedCache>

       <namedCache name="LuceneIndexesData">

            <clustering mode="replication">

                <stateRetrieval fetchInMemoryState="true" logFlushTimeout="300000" />

                <sync replTimeout="50000" />

                <l1 enabled="false" />

            </clustering>

            <locking lockAcquisitionTimeout="20000" writeSkewCheck="false" concurrencyLevel="5000" useLockStriping="false" />

       <loaders shared="true" preload="true" >

             <loader class="org.infinispan.loaders.jdbc.stringbased.JdbcStringBasedCacheStore" fetchPersistentState="true" ignoreModifications="false"  purgeOnStartup="false">

                <properties>

                   <property name="key2StringMapperClass" value="org.infinispan.lucene.LuceneKey2StringMapper" />

                   <property name="createTableOnStart" value="true" />

                   <property name="datasourceJndiLocation" value="java:/MyDatasource" />

                   <property name="connectionFactoryClass" value="org.infinispan.loaders.jdbc.connectionfactory.ManagedConnectionFactory" />

                   <property name="dataColumnType" value="BLOB" />

                   <property name="idColumnType" value="VARCHAR(256)" />

                   <property name="idColumnName" value="idCol" />

                   <property name="dataColumnName" value="dataCol" />

                   <property name="stringsTableNamePrefix" value="LuceneIndexesData" />

                   <property name="timestampColumnName" value="timestampCol" />

                   <property name="timestampColumnType" value="BIGINT" />

                </properties>

                <async enabled="true" flushLockTimeout="2500" shutdownTimeout="7200" threadPoolSize="5" />

             </loader>

          </loaders>

          <eviction maxEntries="8000" strategy="LIRS" wakeUpInterval="18000000" />  

          <expiration maxIdle="-1" />

       </namedCache>

    </infinispan>

     

    Backend node, persistence.xml:

     

    <?xml version="1.0" encoding="UTF-8"?>

    <persistence xmlns="http://java.sun.com/xml/ns/persistence" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd" version="1.0">        

       <persistence-unit name="BACKEND" transaction-type="JTA">

          <provider>org.hibernate.ejb.HibernatePersistence</provider>

          <jta-data-source>java:/MyDatasource</jta-data-source>

          <properties>

             <property name="hibernate.dialect" value="org.hibernate.dialect.MySQLDialect"/>

             <property name="hibernate.hbm2ddl.auto" value="update"/>

             <property name="hibernate.show_sql" value="false"/>

             <property name="hibernate.format_sql" value="false"/>

             <property name="hibernate.transaction.manager_lookup_class" value="org.hibernate.transaction.JBossTransactionManagerLookup"/>

                         <property name="hibernate.cache.provider_class" value="net.sf.ehcache.hibernate.EhCacheProvider"/>

                         <property name="hibernate.cache.use_query_cache" value="true"/>

                         <property name="hibernate.cache.use_second_level_cache" value="true"/>

                         <property name="hibernate.generate_statistics" value="true"/>

                         <property name="hibernate.cache.use_structured_entries" value="true"/>

                         <property name="hibernate.cache.provider_configuration_file_resource_path" value="/ehcache.xml" />

                         <property name="net.sf.ehcache.configurationResourceName" value="ehcache.xml"/> 

                         <property name="hibernate.search.default.directory_provider" value="infinispan"/>

                         <property name="hibernate.search.default.chunk_size" value="65000"/>

                         <property name="hibernate.search.infinispan.cachemanager_jndiname" value="java:indexLucene"/> 

                         <property name="hibernate.search.default.exclusive_index_use" value="true"/>

                         <property name="hibernate.search.default.optimizer.transaction_limit.max" value="100"/>

                         <property name="hibernate.search.default.optimizer.operation_limit.max" value = "500"/>                  

              </properties>

    </persistence-unit>

    </persistence>

     

     

    Cluster frontend nodes (3 nodes), read-only search application

     

    infinispan.xml differences:

     

    <namedCache name="LuceneIndexesMetadata">

            <clustering mode="replication">

                <stateRetrieval fetchInMemoryState="true" logFlushTimeout="300000" />

                <sync replTimeout="50000" />

                <l1 enabled="false" />

            </clustering>

            <locking lockAcquisitionTimeout="20000" writeSkewCheck="false" concurrencyLevel="5000" useLockStriping="false" />

       </namedCache>

        <namedCache name="LuceneIndexesData">

           <clustering mode="replication">

                <stateRetrieval fetchInMemoryState="false" logFlushTimeout="300000" />

                <sync replTimeout="50000" />

                <l1 enabled="false" />

            </clustering>

              <loaders shared="true" preload="true" >

             <loader class="org.infinispan.loaders.jdbc.stringbased.JdbcStringBasedCacheStore"  fetchPersistentState="true" ignoreModifications="true" purgeOnStartup="false">

                <properties>

                   <property name="key2StringMapperClass" value="org.infinispan.lucene.LuceneKey2StringMapper" />

                   <property name="createTableOnStart" value="true" />

                   <property name="datasourceJndiLocation" value="java:/MyDatasource" />

                   <property name="connectionFactoryClass" value="org.infinispan.loaders.jdbc.connectionfactory.ManagedConnectionFactory" />

                   <property name="dataColumnType" value="BLOB" />

                   <property name="idColumnType" value="VARCHAR(256)" />

                   <property name="idColumnName" value="idCol" />

                   <property name="dataColumnName" value="dataCol" />

                   <property name="stringsTableNamePrefix" value="LuceneIndexesData" />

                   <property name="timestampColumnName" value="timestampCol" />

                   <property name="timestampColumnType" value="BIGINT" />

                </properties>

                <async enabled="true" flushLockTimeout="2500" shutdownTimeout="7200" threadPoolSize="5" />

             </loader>

          </loaders>

            <locking lockAcquisitionTimeout="20000" writeSkewCheck="false" concurrencyLevel="5000" useLockStriping="false" />

              <eviction maxEntries="8000" strategy="LIRS" wakeUpInterval="1800000" />

            <expiration maxIdle="-1" />

    </namedCache>

     

    persistence.xml differences:

     

    <persistence-unit name="FRONTEND" transaction-type="JTA">

          <provider>org.hibernate.ejb.HibernatePersistence</provider>

          <jta-data-source>java:/MyDatasource</jta-data-source>

           <properties>

             <property name="hibernate.dialect" value="org.hibernate.dialect.MySQLDialect"/>

             <property name="hibernate.hbm2ddl.auto" value="update"/>

             <property name="hibernate.show_sql" value="false"/>

             <property name="hibernate.format_sql" value="true"/>

             <property name="hibernate.transaction.manager_lookup_class" value="org.hibernate.transaction.JBossTransactionManagerLookup"/>

             <property name="hibernate.cache.provider_class" value="net.sf.ehcache.hibernate.EhCacheProvider"/>

             <property name="hibernate.cache.use_query_cache" value="true"/>

                   <property name="hibernate.cache.use_second_level_cache" value="true"/>

                   <property name="hibernate.generate_statistics" value="true"/>

                   <property name="hibernate.cache.use_structured_entries" value="true"/>

                   <property name="hibernate.cache.provider_configuration_file_resource_path" value="/ehcache.xml" />            

                   <property name="net.sf.ehcache.configurationResourceName" value="ehcache.xml"/>

                   <property name="hibernate.search.worker.backend" value="jgroupsSlave"/>

                   <property name="hibernate.search.default.exclusive_index_use" value="true"/>

                   <property name="hibernate.search.default.optimizer.operation_limit.max" value="1000"/>        

             <property name="hibernate.search.default.directory_provider" value="infinispan"/>

             <property name="hibernate.search.default.chunk_size" value="65000"/>

             <property name="hibernate.search.infinispan.cachemanager_jndiname" value="java:indexLucene"/>

          </properties>

       </persistence-unit>

     

     

    Thanks for your interest!

  • 9. Re: Expiration and Lucene Directory
    Sanne Grinovero Master

    Thanks. One more question: which versions of Hibernate Search and  Infinispan?

  • 10. Re: Expiration and Lucene Directory
    Horacio Vico Newbie

    HSearch 3.4.1.FINAL and Infinispan 4.2.1-Final

  • 11. Re: Expiration and Lucene Directory
    Horacio Vico Newbie

    Looking at this property which is set in both backend and frontend nodes:

     

         <property name="hibernate.search.default.exclusive_index_use" value="true"/>

     

    Can I use that exclusive index considering my architecture (backend "writer" and frontend "readers")

  • 12. Re: Expiration and Lucene Directory
    Sanne Grinovero Master

    Yes that works fine as long as you have a single node writing.

     

    I might have already asked you.. no way you can try Search 4.1 and Infinispan 5.1?

  • 13. Re: Expiration and Lucene Directory
    Horacio Vico Newbie

    Unfortunatelly no, as the upgrade matrix is too complex and do not have resources for that kind of project right now.

     

    My project is built over SEAM 2 and Richfaces 3, and my JBoss AS is a 4.2.

     

    I would like to create a batch process to purge and rebuild indexes every night, but I should find a way to "compact" or clean the cachestore before the reindex/optimization process.

  • 14. Re: Expiration and Lucene Directory
    Horacio Vico Newbie

    Maybe this information provides some light:

     

    Yesterday I ran a full index rebuild. My cachestore table had 6000 rows aprox.

    Today, after some usage it has 18.000 rows.

     

    Querying the index table by SQL, and looking at the "idCol" column I noticed that 11.000 of that rows are Lucene "prx" files:

     

    http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/fileformats.html#Positions

     

    When I run a FullTextSession.getSearchFactory().optimize() those "files" remain untouched.

     

    Hope this helps!

1 2 Previous Next