Support Questions

Jaraxal · ‎12-15-2015

I'm using the Elasticsearch Hadoop connector to push data via Pig to Elasticsearch. This process has worked very well for me on my 6 node cluster. I now have a 12 node cluster and I'm running HDP 2.3. Now it seems that Pig is pushing too much data to Elasticsearch and it can't keep up.

My cluster is running 134 tasks at once for this Pig job. Is there any easy way to change the number of simultaneous tasks running for a Pig/Tez job? I changed my queue configuration to limit the resources to 50% for one of the queues and I'm still overloading Elasticsearch.

jniemiec · ‎12-22-2015

I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data.

#start pig with

pig -x mr

#in your pig script

set mapreduce.input.fileinputformat.split.minsize= N

set mapreduce.input.fileinputformat.split.maxsize= X

View solution in original post

ajay_kumar · ‎12-15-2015

Have you tried setting mapreduce.jobtracker.maxtasks.perjob for your pig application?

Alternatively you can use node labels to run your pig job on specific nodes.

Jaraxal · ‎12-21-2015

In my searching for ways to reduce parallel tasks, I had not yet seen that. I will give it a try. Thank you!

jniemiec · ‎12-22-2015

I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data.

#start pig with

pig -x mr

#in your pig script

set mapreduce.input.fileinputformat.split.minsize= N

set mapreduce.input.fileinputformat.split.maxsize= X

Jaraxal · ‎12-22-2015

I think this is probably the better approach. We were initially using Tez for the better performance over M/R. However with our cluster easily throttling Elasticsearch, it seems reasonable to revert back to M/R and tweak settings that are easier to control.

Thank you.

Cloudera Community

Support Questions

How do I limit the number of simultaneous tasks for a single Tez job?