Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

New cluster architecture

New cluster architecture


Hello community!


I'll have to start a new ecosystem to store and process (ETL) customer's data.

I would estimate approximately 100 million records per day for the average customer with records being 100-500 bytes. This data should not be stored for a long period of time. The idea is to have more or less 100 clients.

If my math’s are right I'll need to ingest 1.5TB for each customer per month (500 bytes, the worst case, * 1 million records = 50 gigs * 30 = 1.5TB), if the idea is to get close 100 customers the final number should be 1.5PB of data. Initially we ingest data as is, pass through some ETL processes and store some kind of summarized data in a nonSQL database, after this process the original data must be deleted.

Then the stored data will be mined to create relationships in a graph database and to be populated in a relational database (such as Postgres).


I've been thinking in many possible solutions, some of them are:


- HBase + Hive/Impala

- HBase + Spark

- HBase + MapReduce (pretty the same as HBase + Hive but more complex)

- Spark standalone + S3 + NoSQL DataBase (not include Hadoop)


If you think in something better or performant please let me know.





Don't have an account?
Coming from Hortonworks? Activate your account here