Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
We are currently involved in a project where we pretend to use Hadoop to process and store a huge amount of data and Spark for real-time analytics. At this moment, we are currently wondering which architecture would suite best for our case from this two possibilities:
A) A single Multifunctional server with HBASE, HDFS (standard configuration) and spark with 3 masters and tens of slaves.
B) Three specialized clusters optimized for:
- Cluster 1, specialized for storage: HDFS configured to manage these small files. This cluster will process very small files and will store every file in HBASE.
- Cluster 2, specialized for batch processing: Will process huge files of structured data. HDFS will be configured to deal with these huge files.
- Cluster 3, with Spark oriented to real-time processing on streaming data.
What would be the most important criteria to determine the best option in our case?