I have just started working on big data, so my question might be very basic for some of you but plz bear with me. I am using an azure machine with pre-configured HDP-Sandbox. The specs of the machine are: RAM: 64 GB Hard Disk: 2TB+1TB. CPU(s): 2. Problem: I have a big data file of around(600 GB), I have been able to store the data in hive table as text file. Now my goal is to create an ORC formatted table from that table, so that I can make queries run faster. But when I insert the data into ORC table from table stored as textfile, the query never gets completed. Also I have noticed memory allocated for all Yarn containers is 3000 MB out of 62.9 GB, so I tried to increase the yarn container size from Ambari Dashboard and then ran query but every task in the query failed. May be I have to change the size other dependent parameters too but I dont know what are those. Therefore, can anyone suggest/help me, how do I increase the size of the Yarn container and MapReduce in order to make maximum use of the machine and queries run faster and successfully. Also, is 600 GB data is really a big data for machine with such specs? How do I make sure that the queries do not failed due to vertex failure etc.
... View more