Planned to setup hadoop ecosystem to consume data from RDMS and store it in hbase. perform transformation and data cleaning in hbase and load that data into hive warehouse. consume the warehouse data in ML (spark, flink and scala). so need to know which is best practise to set up for above requirement?
1. Single cluster which has a master and 5 to 10 slaves with maximum GB RAM and diskspace.
2. Three cluster (3 master and respective slaves) for Hbase, hive and ML seperately.
Data will get increases every years as million to trillion. so need to know the best practice for above requirements ( 1 or 2 ) ?
this is a quite generic question, so a precise answer is difficult. But given the fact that you want to get the data from a RDBMS I think you can go with one cluster. What you need to consider is how much througput you will have to handle, typically the RDBMS will limit this anyway.
Just for consideration: Spark and Flink are merely RAM intensive, while HBase uses HDD and RAM, dependent on the load. Hive again uses mainly HDD for M&R. But if you plan to create an external Hive table pointing to HBase, you are again in the Hbase usage pattern. Assuming you have a sufficient RAM available on your nodes (i would go with >= 2 Gb per PCU core) I think one cluster for all would do.
If later the load increase, you can scale the cluster, one of the big advantages of Hadoop.