Support Questions
Find answers, ask questions, and share your expertise

Best Practices for 3-5 Node cluster HDP and HDF

New Contributor

We are looking to start a 3-5 node cluster in AWS for a mix of NiFi, Spark and Hive work, and are looking for best practices on cluster installation and hardware sizing. The workload will be as follows

1) NiFi will be used to stage files in HDFS, from a local file system. The files are approximately 1GB in size, and this job will run once a day and stage a single 1GB file.
2) Hive will be used to load the file from step 1) and will load this into a Hive Table. There are approximately 10m records in the resulting table.
3) Ad hock querying will be performed against the table from step 2) (in Zepplin), where simple aggregations will be done. The result set will be joined onto two other tables (one with 1400 records and the other with 4000).
4) Spark will be used to apply a ML model on the file from step 1). (Possibly Random Forest Regressor). There are approximately 10 million records in the data set, on which the Random Forest will be run, and there are approximately 10 categorical variables in each record.

We would like guidance as to best practices in terms of the deployment and hardware sizing.

Any explanation as to the rational for the suggested setup would be appreciated.