Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How do I limit the number of simultaneous tasks for a single Tez job?

avatar
Contributor

I'm using the Elasticsearch Hadoop connector to push data via Pig to Elasticsearch. This process has worked very well for me on my 6 node cluster. I now have a 12 node cluster and I'm running HDP 2.3. Now it seems that Pig is pushing too much data to Elasticsearch and it can't keep up.

My cluster is running 134 tasks at once for this Pig job. Is there any easy way to change the number of simultaneous tasks running for a Pig/Tez job? I changed my queue configuration to limit the resources to 50% for one of the queues and I'm still overloading Elasticsearch.

1 ACCEPTED SOLUTION

avatar
Rising Star

I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data.

#start pig with

pig -x mr

#in your pig script

set mapreduce.input.fileinputformat.split.minsize= N

set mapreduce.input.fileinputformat.split.maxsize= X

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

Have you tried setting mapreduce.jobtracker.maxtasks.perjob for your pig application?

Alternatively you can use node labels to run your pig job on specific nodes.

avatar
Contributor

In my searching for ways to reduce parallel tasks, I had not yet seen that. I will give it a try. Thank you!

avatar
Rising Star

I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data.

#start pig with

pig -x mr

#in your pig script

set mapreduce.input.fileinputformat.split.minsize= N

set mapreduce.input.fileinputformat.split.maxsize= X

avatar
Contributor

I think this is probably the better approach. We were initially using Tez for the better performance over M/R. However with our cluster easily throttling Elasticsearch, it seems reasonable to revert back to M/R and tweak settings that are easier to control.

Thank you.