Is Infinispan a good fit for this use case?
john.sanda Jul 28, 2015 10:00 AMI work on Hawkular Metrics (H-Metrics) which is basically a time series database built on top of Cassandra. There are different situations in which we generate new time series from one or more existing time series. This will be done as one or more background jobs that run periodically. For example, H-Metrics defines a counter metric type. We automatically compute and store the rate for counters as new time series. This is done in a background job that runs on pre-configured interval, maybe every minute. In the worst case scenario, we have to perform N reads where N is the number of counter metrics. Cassandra is very scalable; so, we can add (Cassandra) nodes as necessary to deal with the read load, but this does not reduce the amount of I/O we have to do. We can do some schema optimizations that will reduce the overall number of reads. That might be an adequate approach, but I am curious about whether or not Infinispan would be a good fit in this situation.
Here is one approach I have considered. I maintain a separate cache per minute since I want to calculate/update rates every minute. When I receive a request to store counter data points, I write the data points both to Cassandra and to the cache for the current minute. The cache entry key would be the metric id, and the value would be the metric data point value. In this particular case I only need to store the latest value for the current minute, so I can simply replace existing cache entries. The background job that runs every minute would no longer query Cassandra for counter data points. It might instead kick off a DistributedExecutorService task that calculates the rates for each of the cache entries, writes them to Cassandra, and the finally deletes the cache. In this example, I simply inserted (or replaced) cache entries. For other metric data types, I will need to update existing cache entries.
The appeal of this approach is that I completely eliminate having to query Cassandra when calculating new time series data, but I do not have enough experience with Infinispan to know whether or not this would be a sound approach. Here are some additional questions I am considering.
Will the performance of lots of concurrent reads/writes on a cache be adequate such that it won't be a bottleneck when storing data points? Currently, the write path for storing data points only involves writing to Cassandra which is very fast. With a caching layer, it could also involve writing new cache entries and/or updating existing cache entries. For a local cache when all entries are in memory, I would expect cache operations to extremely fast as I am performing simple updates on the entries.
What about when the cache is distributed? For the moment, assume I am running multiple H-Metrics servers with embedded, distributed caches. If server-1 receives a request to store data and the cache for that data resides on server-2, do I need to be concerned latency of performing cache updates?
I previously mentioned using a separate cache for each minute. A new cache is created whenever I receive a metric data point having a timestamp for the next minute. Consider again a distributed cache with multiple H-Metrics servers. What if two H-Metrics servers both try to create the cache for the next minute at the same time. Do I need to be concerned about race conditions? Is cache creation fast? I am curious about how fast it is as the number of ISPN nodes increases.
When nodes are added/removed, what kind of performance impact is there data being reshuffled? Can I continue to serve client requests when a node is added/removed, or should this be done as an offline, maintenance operation?
So far I have only discussed using embedded caches? What are the pros/cons of using the HotRod server instead?