Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How do I limit the number of simultaneous tasks for a single Tez job?

SOLVED Go to solution
Highlighted

How do I limit the number of simultaneous tasks for a single Tez job?

New Contributor

I'm using the Elasticsearch Hadoop connector to push data via Pig to Elasticsearch. This process has worked very well for me on my 6 node cluster. I now have a 12 node cluster and I'm running HDP 2.3. Now it seems that Pig is pushing too much data to Elasticsearch and it can't keep up.

My cluster is running 134 tasks at once for this Pig job. Is there any easy way to change the number of simultaneous tasks running for a Pig/Tez job? I changed my queue configuration to limit the resources to 50% for one of the queues and I'm still overloading Elasticsearch.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How do I limit the number of simultaneous tasks for a single Tez job?

Contributor

I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data.

#start pig with

pig -x mr

#in your pig script

set mapreduce.input.fileinputformat.split.minsize= N

set mapreduce.input.fileinputformat.split.maxsize= X

4 REPLIES 4

Re: How do I limit the number of simultaneous tasks for a single Tez job?

Rising Star

Have you tried setting mapreduce.jobtracker.maxtasks.perjob for your pig application?

Alternatively you can use node labels to run your pig job on specific nodes.

Re: How do I limit the number of simultaneous tasks for a single Tez job?

New Contributor

In my searching for ways to reduce parallel tasks, I had not yet seen that. I will give it a try. Thank you!

Re: How do I limit the number of simultaneous tasks for a single Tez job?

Contributor

I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data.

#start pig with

pig -x mr

#in your pig script

set mapreduce.input.fileinputformat.split.minsize= N

set mapreduce.input.fileinputformat.split.maxsize= X

Re: How do I limit the number of simultaneous tasks for a single Tez job?

New Contributor

I think this is probably the better approach. We were initially using Tez for the better performance over M/R. However with our cluster easily throttling Elasticsearch, it seems reasonable to revert back to M/R and tweak settings that are easier to control.

Thank you.