For a greenfield hadoop implementation, what are the things to consider during capacity planning? We have SLAs to be met for the many use cases that need Hadoop solutions. However, we have granular processing information for only one use case based on which we are planning to do capacity planning analysis.
For example, assume it takes 40GB of input data, does data cleansing and reduplication activities, after which performs multiple joins, groupings, aggregations etc. Once the data processing is complete, data is written onto HDFS for further consumption by OLAP systems.
If the present SLA for this activity's data processing is X hours and is being met, we want to take this as reference for other cases and come up with an understanding on hardware requirements to meet the target SLAs for all the other use cases.
Just wondering, what is the best strategy to consider for this kind of capacity analysis to arrive at decent sizing estimates. Also, what are the factors to be taken into consideration while doing capacity planning?
The best strategy in my opinion is to setup a development cluster and test it out, then scale it up. Hadoop is designed in a way that most tasks scale linearly with resources thrown at them.
For example I was able to load 300GB of raw delimited data with some data masking and output into ORC files in around 30 minutes. ( 7 big datanodes )
Aggregation queries and joins with some lookup tables for 4 usecases ( 4 pre aggregated tables ) then took additional 10-15 minutes.
But it depends a lot on the kind of processing you want to make, connection to source systems etc. So testing it on a development system and then scaling up the target system based on that seems to be the best approach for me.
And regarding things to look out for? Really depends on what you want to do with the data. It needs to be end to end after all. So you need to look at every single piece. For example Sqoop ( connection to db bottlenecks? ), transformations ( pig/hive ) consumption ( hives ? )