Could someone help me in setting up a distributed hadoop cluster.
What should be size of DN,NN ? How the size is calculated?
How to calculate the size of edge node, Data node and name node?
How to calculate the no of EN and DN?
What are the considerations need for DR?
What are the areas does an architect & admin has to concentrate to set up a cluster for production?
I know the question I asked will be have a brief answers. But If someone could help me to start with it would be helpful.
Check out the documentation. http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/index.html
How many nodes? What hardware do you have? How much data will you have?
What operating system?
What are your use cases?
How large are the boxes? # of CPUS? RAM? Disk size?
128 - 256Gig of RAM per server is a sweet spot.
For Distaster Recovery, can you afford a full copy of your production site with fast connection between them. Do you want to do dual ingest or synchronize with Falcon later. Or are hourly synchronizations okay.
You have to setup HDFS for snapshots.
How many users will query your data?
It's really hard to have general guidelines for specific needs
There are so many questions and options, if this is for a real cluster, you should sit with an experienced Hadoop person for a few weeks.
@Timothy Spann Actually yes. We are planning to implement in production but just wanted to know if there are any basic/general capacity configuration to start with. Later when the data explodes then we may extend the no of nodes and other things and I feel that will be easier. But to start we need to have a balanced cluster which should handle decent loads to gain confidence only then we may proceed forward to extend further. Also wanted to know the possible calculations which has to be done for capacity planning.
Good Heap Size Guide
Capacity planning depends on multiple factors -
- Amount of data that needs needs to be stored, the incremental data growth rate for next 2-3 years & data retention period
- Kind of processing Real time or Batch
- Use cases - Based on the use case, Workload patterns can be derived - Balance workload(Work load equally distributed - CPU bound, Disk I/O Bound, Network I/O bound) , Compute intensive (CPU bound - involving complex algorithms, NLP, HPCC etc ) or I/O intensive - Jobs requiring very less compute power and more of I/O (Archival use case have lots of cold data).
If you don’t know the workload pattern then it is recommended to start with balanced workload
- The SLA’s for the system.
You need to consider all the above factors. It really depends on what you want to do with the data & need to look at each and every single piece. For e.g. Ingesting data from different sources (Sqoop,Flume, NiFi etc.), transformations if any ( pig/hive ) ,consumption ( hive ) or real time processing - Distributed message queue (Kafka ), Storage (HBase), .
The best strategy in my opinion is to setup a development cluster and test it out, then scale it up. Hadoop is designed in a way that most tasks scale linearly with resources allocated to them.
Below 2 links are good starting point :