I am a newbie to the big data paradigm. I have been working with the sandbox for some time. I have installed an independent single node HDP 2.5 on a 4 core 8 GB machine to start with. Please help me understand two things:
1. How is memory (Both RAM and ROM) used by different components in HDP example Yarn, HDFS etc. as I am not able to process even a 50 MB file stored in HDFS with Spark. My Yarn memory is almost full.
2. How should I plan the scaling of my single node into a cluster to handle GBs of data. I am planning to use spark on top of Yarn.
What are you requirements and how would do you plan to scale. Right now, it seems you just have a laptop. Is this a VM? How many disks are there? I am guessing one disk. That being said, you should be able to process one 50 MB file. In this case it's only one block. You mention your YARN memory is "almost full". How much memory have you assigned to YARN? When you run a job on 50 MB file, you can't possibly exhaust 8GB of memory (with a few extremely rare exceptions).
Check your Spark settings to allow to spill to disk. Check the following link.
As for scaling your nodes, you need to have a baseline. If this is for personal home use to learn Hadoop, just keep practicing one node and you'll learn quite a bit over time. If this is for your work, then you need to start with at least 5 nodes and each node having multiple disks and reasonable amount of memory, like 16 GB to start with depending on your use cases.
Hadoop scales linearly up to hundreds of nodes (I am not talking about thousands node clusters) which can easily support peta byte scale data. So start with small 5 node, create a baseline, if you need to scale, go up to 8-10 nodes and chart all your numbers and assume linear scalability up to easily 500 nodes.
A typical Hadoop server is 12x2TB disks, 128-256 GB memory and anywhere between 16-32 cores (not considering hyper threading).
Data that you store should also typically be compressed (I prefer Snappy as it offers a good balance between compression ratio and compression/decompression speeds).
Thanks @Kuldeep Kulkarni these links are really helpful.
@mqureshi thanks for a descriptive answer. I am currently using a VM of the above mentioned config.I tried changing the allocated memory to Yarn container max size. It reduced the Yarn usage but is still 86%. As you pointed out I also think I should start with a single node HDP environment.
So what should be my system config if I want to process a 1 GB file in Spark on single node?