FileCacheStore redesign

This is an early design doc to redesign Infinispan's FileCacheStore.

 

Some general ideas:

 

  • B+Tree-based, good for fast lookup (reading), but slower for writing.
  • Append-only store
    • Fast writing, slow to read
    • Useful if data set is held in memory and write through is purely for resilience (not expanded capacity).
    • Would require a separate thread/process to handle compacting (back into a B+Tree)

 

Some good background reading:

 

https://www.kernel.org/pub/linux/kernel/people/suparna/aio/262/results/aio-stress-results.txt

http://www.acunu.com/2/post/2011/03/why-is-acunu-in-kernel.html

http://www.datastax.com/dev/blog/what-persistence-and-why-does-it-matter

http://www.datastax.com/dev/blog/cassandra-file-system-design

http://wiki.apache.org/cassandra/Durability

http://wiki.apache.org/cassandra/ArchitectureCommitLog

http://www.slideshare.net/rbranson/cassandra-and-solid-state-drives

http://antirez.com/post/redis-persistence-demystified.html

http://hornetq.blogspot.co.uk/2009/08/persistence-on-hornetq.html

http://hornetq.sourceforge.net/docs/hornetq-2.0.0.BETA5/user-manual/en/html/persistence.html

http://hornetq.sourceforge.net/docs/hornetq-2.0.0.GA/user-manual/en/html/libaio.html

https://code.google.com/p/leveldb/

 

Related JIRAs:

 

https://issues.jboss.org/browse/ISPN-1808

https://issues.jboss.org/browse/ISPN-1362

https://issues.jboss.org/browse/ISPN-1303

https://issues.jboss.org/browse/ISPN-1302

https://issues.jboss.org/browse/ISPN-1301

https://issues.jboss.org/browse/ISPN-517

 

Test plan:

  • Operations to test: load, store, remove, preload
  • These operations should be tested in two major scenarios:
    • Test operations, a local cache with no eviction plugged with the file cache store (no async store), in such way that the cache the cache store have exactly the same data. E.g. 1 GB data stored. This test aims to see how fast we can update a cache store. Reads would be very fast because they'd be served by the in-memory cache.
    • Test operations, in a small in-memory local cache with agreesive eviction settings plugged with a file based cache store (no async store) that's used as overflow. E.g. keep 1GB in memory and store 20 GB in file store. Here we're trying to get a better idea of how good the cache store is at reading data. That's because most of the data will be present in the cache store and not in the cache, so it requires reading the cache store and storing that data in-memory.
  • Before writing any cache stores, we should evaluate the performance of the cache stores available right now, which are:
  • Preferably, tests should be run in a modern SSD drives.

 

Objectives:

  • For each of the major scenarions, target performance objectives need to be set. TBD.

 

Current results:

All setups used local-cache, benchmark was executed via Radargun (actually version not merged into master yet [2]). I've used 4 nodes just to get more data - each slave was absolutely independent of the others.

 

First test was preloading performance - the cache started and tried to load 1GB of data from harddrive. Without cachestore the startup takes about 2 - 4 seconds, average numbers for the cachestores are below:

 

 

Cache storeStartup-time
FileCacheStore9.8 s
KarstenFileCacheStore14 s
LevelDB-JAVA impl.12.3 s
LevelDB-JNI impl12.9 s

 

 

IMO nothing special, all times seem affordable. We don't benchmark exactly storing the data into the cachestore, here FileCacheStore took about 44 minutes, while Karsten about 38 seconds, LevelDB-JAVA 4 minutes and LevelDB-JNI 96 seconds. The units are right, it's minutes compared to seconds. But we all know that FileCacheStore is bloody slow.

 

Second test is stress test (5 minutes, preceded by 2 minute warmup) where each of 10 threads works on 10k entries with 1kB values (~100 MB in total). 20 % writes, 80 % reads, as usual. No eviction is configured, therefore the cache-store works as a persistent storage only for case of crash.

 

Cache storereads/swrites/s
note
FileCacheStore3.1M112on one node the performance was only 2.96M reads/s 75 writes/s
KarstenFileCacheStore9.2M226k
LevelDB-JAVA impl.3.9M5100
LevelDB-JNI impl.6.6M14kon one node the performance was 3.9M/8.3k - about half of the others
Without cache store15.5M4.4M

 

Karsten implementation pretty rules here for two reasons. First of all, it does not flush the data (it calls only RandomAccessFile.write()). Other cheat is that it stores in-memory the keys and offsets of data values in the database file. Therefore, it's definitely the best choice for this scenario, but it does not allow to scale the cache-store, especially in cases where the keys are big and values small. However, this performance boost is definitely worth checking - I could think of caching the disk offsets in memory and querying persistent index only in case of missing record, with part of the persistent index flushed asynchronously (the index can be always rebuilt during the preloading for case of crash).

 

The third test should have tested the scenario with more data to be stored than memory - therefore, the stressors operated on 100k entries (~100 MB of data) but eviction was set to 10k entries (9216 entries ended up in memory after the test has ended).

 

Cache storereads/s

writes/s

note
FileCacheStore750285one node had only 524 reads and 213 writes per second
KarstenFileCacheStore458k137k
LevelDB-JAVA impl.21k9kthese values are for mmap implementation (typo in test)
LevelDB-JNI impl.13k-46k6.6k-15.2kthe performance varied a lot!

 

We have also tested the second and third scenario with increased amount of data used - each thread operated on 200k entries, giving about 2 GB of data in total. The test execution was also prolonged to 5 minute warmup and 10 minute test. FileCacheStore was excluded from this comparison.

Update: I have also added the FileChannel.force(false) calls to the Karsten implementation and the results are provided.

 

Persistent storage scenario:

 

 

Cache storereads/swrites/snote
KarstenFileCacheStore3.8M-5.3M3600-7700

KarstenFileCacheStore - force(false)

3.2M1650
LevelDB-JAVA3.8M2200
LevelDB-JAVA - force(false)3.2M400
LevelDB-JAVA - force(false), SNAPPY (iq80)3.2M390
LevelDB-JNI5.3M4650
LevelDB-JNI - sync writes3.0M1240
LevelDB-JNI - sync writes, SNAPPY3.2M1240
Without cache store6.2M1.9M

 

Overflow scenario:

 

Cache storereads/swrites/snote
KarstenFileCacheStore265k16kone node had 21k writes/s
KarstenFileCacheStore - force(false)285k1200
LevelDB-JAVA500 or 5900400 or 4000

one node 10x faster! It shows different memory and CPU usage pattern.

these values are for mmap implementation (typo in test)

LevelDB-JAVA - force(false)950520
LevelDB-JAVA - force(false), SNAPPY (iq80)950515
LevelDB-JNI9200-14.4k5400-6500
LevelDB-JNI - sync writes15.5k900some variance between nodes
LevelDB-JNI - sync writes, SNAPPY14k-19k750-1100one node slower at writes

 

Obviously the performance dropped radically from the 100 MB case.

 

Another test tried to find out the impact of value size. We have used the persistent configuration, with each thread operating on 100k entries with value size 1kB, 25k entries with value size 4kB or 6125 entries with value size 16kB.

 

Cache store
1kB values
4kB values16kB values
KarstenFileCacheStore13k writes/s, one node 22k13k writes/s one node 24k12.5k writes/s, one node 19k
LevelDB-JNI6k writes/s1400 writes/s400 writes/s

 

Next test used 1kB, 4kB or 16kB keys and empty values:

 

Cache store
1kB keys
4kB keys16kB keys
KarstenFileCacheStore13k writes/s12k writes/s7k writes/s
LevelDB-JNI8k writes/s490 writes/s130 writes/s