I have just started working on big data, so my question might be very basic for some of you but plz bear with me. I am using an azure machine with pre-configured HDP-Sandbox. The specs of the machine are:
RAM: 64 GB
Hard Disk: 2TB+1TB.
Problem: I have a big data file of around(600 GB), I have been able to store the data in hive table as text file. Now my goal is to create an ORC formatted table from that table, so that I can make queries run faster. But when I insert the data into ORC table from table stored as textfile, the query never gets completed. Also I have noticed memory allocated for all Yarn containers is 3000 MB out of 62.9 GB, so I tried to increase the yarn container size from Ambari Dashboard and then ran query but every task in the query failed. May be I have to change the size other dependent parameters too but I dont know what are those.
Therefore, can anyone suggest/help me, how do I increase the size of the Yarn container and MapReduce in order to make maximum use of the machine and queries run faster and successfully.
Also, is 600 GB data is really a big data for machine with such specs?
How do I make sure that the queries do not failed due to vertex failure etc.
I believe your biggest problem is that you are trying to use the HDP Sandbox for something of any decent size. That environment wasn't necessary built for you to runs 100's of GB datasets (which itself is surely not all that "big" of Big Data). The Sandbox also has a bunch of configuration settings focused on running a pseudo-cluster (all on one box) which is NOT idea for any job of any real size.
You did go down the right path of changing the max amount of memory that YARN can use, but at the end of the day, your box only has two CPUs and you really can't run that many containers anyways. You'd probably need to go change the size of the TEZ containers, too, for Hive/TEZ to ask for more than whatever the tiny configuration from the Sandbox is granting you.
I don't know the costing model, but I'm betting 4 boxes with 16GB each would be cheaper than this 64GB one you are using now and that would allow you to spread out the workload across multiple machines (and yes, you'd have to install HDP via Ambari, the the http://docs.hortonworks.com site can help a LOT).
Good luck and happy Hadooping!
So, I cannot process this data on one machine? Is it the final solution that I will have to setup different machines for this task?