BrnoCacheStore design proposal

Version 3

Created by rvansa on Jul 2, 2013 2:55 AM. Last modified by rvansa on Aug 16, 2013 3:51 AM.

This is a local file-based cache store, optimized for write-through use with strong consistency guarantees (ability to flush disk operations before returning from the store call).

There are three threads operating in the cache-store:

* LogAppender: Requests to store entries are passed to the LogAppender thread via queue, then the requestor threads wait until LogAppender notifies them about successfull store (the waiting should be implemented by busy-waiting over volatile counter with yielding). LogAppender serializes the writes into append-only file, writes the offset into TemporaryTable and enqueues request to update index into UpdateQueue. The append-only files have limited size, when the file is full, new file is started.

* IndexUpdater: Reads the UpdateQueue, applies the operation into B-tree-like structure Index (exact description below) and then removes the entry from TemporaryTable. When the Index is overwriten, the current entry offset is retrieved and IndexUpdater increases the unused space statistics in FileStats.

* Compactor: When a limit of unused space in some file is reached (according to FileStats), the Compactor starts reading this file sequentially, querying TemporaryTable or Index for the current entry position and copying the unchanged entries into another file. For the entries that are still valid in the original file, a compare-and-set (file-offset based) request is enqueued into UpdateQueue - therefore this operation cannot interfere with concurrent writes overwriting the entry. Multiple files can be merged into single file during compaction.

Structures:

* TemporaryTable keeps the records about current entry location until this is applied to the Index. Each read request goes to the TemporaryTable, if the key is not found here, Index is queried.

* UpdateQueue: bounded queue (to prevent grow the TemporaryTable too much) of either forced writes (used for regular stores) or compare-and-set writes (used by Compactor).

* FileStats: simple (Concurrent)HashTable with actual file size and amount of unused space for each file.

* Index: B+-tree of IndexNodes. The tree is dropped and built a new if the process crashes, it does not need to flush disk operations. On disk it is kept as single random-accessed file, with free blocks list stored in memory.

Note: as IndexUpdater may easily become a bottleneck under heavy load, the IndexUpdater thread, UpdateQueue and tree of IndexNodes may be multiplied several times - we call this splitting the Index into Segments. The keys are splitted according to the hashCode() of the key.

Amount of entries in IndexNode is limited by the size it occupies on disk. This size is limited by configurable NODE_SIZE (4096 bytes by default?), only in case that the node contains single pivot (too long) it can be longer. A key_prefix common for all keys in the IndexNode is stored in order to reduce space requirements. For implementation reasons the keys are limited to 32kB - this requirement may be circumvented later. The pivots are not whole keys - it is the shortest part of key that is greater than all left children (but lesser or equal to all right children) - let us call this key_part. The key_parts are sorted in the IndexNode, naturally. On disk it has this format:

key_prefix_length(2 bytes), key_prefix, num_parts(2 bytes), ( key_part_length (2 bytes), key_part, left_child_index_node_offset (8 bytes))+, right_child_index_node_offset (8 bytes).

In memory, for every child a SoftReference<IndexNode> is held. When this reference is empty (but the offset in file is set), any reader may load the reference using double-locking pattern (synchronized over the reference itself). Note that as the referent in SoftReference is not volatile, the actual load and setting the SoftReference must be separated by writing volatile field - the IndexNode may have volatile reference to itself - synchronization piggybacking. The entry is never loaded by multiple threads in parallel and even may block other threads trying to read this node.

For each node in memory a RW-lock is held. When the IndexUpdater thread updates the Index (modifying some IndexNodes), it prepares a copy of these nodes (properly linked, with strong references in collection). Then, in up-down direction it locks the parent for writing, substitues it by the new data, writes them into file and locks all updated children - then it may release the lock on parent. Other threads than IndexUpdater, naturally, lock the nodes only for reading.

JBossDeveloper

BrnoCacheStore design proposal

Comments