Member since
09-06-2016
2
Posts
0
Kudos Received
0
Solutions
09-28-2016
07:23 PM
1 Kudo
@Eugene Geis Thank you for your detailed description of your issues. There is a single overarching theme to my answer: your cluster is not properly sized for the processing you are doing. Big Data on Hadoop leverages horizontal scaling and like all data processing ... it can reach resource constraints under a given implementation. I had a similar situation happen to me the first time I worked on Hadoop. I had 2.53 billion records that I bulk uploaded to HBase where each record held 57 columns. I was on a 8 node cluster and the first time I bulk loaded to HBase it brought zookeeper, hbase and the cluster to its knees and then to a groaning death. Ultimate root cause was the number of zookepper connections were configured way too low (for the extreme workload I threw at it). I had to configure these and then I bulk loaded in separate chunks as opposed to one shot. Things were still not ideal because HBase major compaction ran for hours afterwards, stressing cpu and memory on all of the nodes. I eventually resized the cluster (added more nodes) to accommodate the load that I was throwing at the cluster. To answer your question, you are throwing too much load at your cluster, given its size. Hadoop is famously robust but only when properly sized. Regarding your local directories filling up, pig runs map-reduce jobs under the covers and intermediate (temporary) data is written to disk between the map and reduce steps. The large amount of intermediate data (produced by your triple join of a TB of data) is spread among so few nodes in you cluster that it exceeds capacity on some of them. My suggestion is to start with lower loads on your given cluster and learn how to optimize your jobs. For example, one common optimization is to compress your intermediate data. See this link on optimizing pig: https://community.hortonworks.com/questions/57449/fine-tune-the-pig-job.html#comment-58059 Next suggestion (after learning to optimize) is to add more data nodes to your cluster to horizontally scale the load. You could simply add nodes and not optimize ... but we always want to optimize to use resources more wisely. See this link for help on sizing your cluster: http://info.hortonworks.com/SizingGuide.html
... View more