Support Questions

TimothySpann · ‎07-20-2016

I am looking for the best option for in-memory computing, fast data. The most recent data we have (current, 5 minutes, 1 hours, < 1 day) we need to have access to as fast as possible.

It's probably 500G or less.

Something like Pivotal's Butterfly Architecture.

What will work best for keeping some of this fast data? I have been looking at Apache Geode, Apache Ignite, Alluxio, SnappyData, Redis, HDFS Ram Data Nodes, HBase In-Memory Column Families, Kafka, Spark Streaming.

Any baked solutions out there that work with HDP?

egarelnabi · ‎08-15-2016

Hi @Timothy Spann

It really all depends on your particular use case and requirements. First, I'm assuming you have a custom-built application that will be querying this data store. If so, how complex do the queries need to be? Do you need Relational (SQL) or Key-Value store? Also, how much latency can you afford?

I would first explore if HBase (or HBase + Phoenix) would be sufficient. This will reduce the number of moving parts you have.

If you're set on in-memory data grids/stores then some options would be Redis, Hazelcast, Teracotta Big Memory and GridGain (Apache Ignite). I believe the last two have connectors to Hadoop that allow writing results of MR jobs directly to the data grid (you'll need to confirm that functionality though)

Like I said before though, I recommend you exhaust the HBase option before moving out-of-stack.

View solution in original post

egarelnabi · ‎08-15-2016

Hi @Timothy Spann

It really all depends on your particular use case and requirements. First, I'm assuming you have a custom-built application that will be querying this data store. If so, how complex do the queries need to be? Do you need Relational (SQL) or Key-Value store? Also, how much latency can you afford?

I would first explore if HBase (or HBase + Phoenix) would be sufficient. This will reduce the number of moving parts you have.

If you're set on in-memory data grids/stores then some options would be Redis, Hazelcast, Teracotta Big Memory and GridGain (Apache Ignite). I believe the last two have connectors to Hadoop that allow writing results of MR jobs directly to the data grid (you'll need to confirm that functionality though)

Like I said before though, I recommend you exhaust the HBase option before moving out-of-stack.

Cloudera Community

Support Questions

In-Memory Layer

Reserved Memory and Vcores in negative value in Ya...

Memory usage of state in Spark Structured Streamin...

Alluxio on HDP 2.4 - In Memory HDFS

Apache Ignite "In-Memory Data Fabric"

Memory Issues in while accessing files in Spark

Out of Memory Error in Hive

Enrich flowfile with in memory look-up dataset

Unified BI Semantic Layer

Speed layer in Oryx2

Cloudera Manager difference in physical memory vs ...