Posts: 39
Registered: ‎02-15-2017

New cluster architecture

Hello community!


I'll have to start a new ecosystem to store and process (ETL) customer's data.

I would estimate approximately 100 million records per day for the average customer with records being 100-500 bytes. This data should not be stored for a long period of time. The idea is to have more or less 100 clients.

If my math’s are right I'll need to ingest 1.5TB for each customer per month (500 bytes, the worst case, * 1 million records = 50 gigs * 30 = 1.5TB), if the idea is to get close 100 customers the final number should be 1.5PB of data. Initially we ingest data as is, pass through some ETL processes and store some kind of summarized data in a nonSQL database, after this process the original data must be deleted.

Then the stored data will be mined to create relationships in a graph database and to be populated in a relational database (such as Postgres).


I've been thinking in many possible solutions, some of them are:


- HBase + Hive/Impala

- HBase + Spark

- HBase + MapReduce (pretty the same as HBase + Hive but more complex)

- Spark standalone + S3 + NoSQL DataBase (not include Hadoop)


If you think in something better or performant please let me know.