- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How do I limit the number of simultaneous tasks for a single Tez job?
- Labels:
-
Apache Tez
Created ‎12-15-2015 09:35 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using the Elasticsearch Hadoop connector to push data via Pig to Elasticsearch. This process has worked very well for me on my 6 node cluster. I now have a 12 node cluster and I'm running HDP 2.3. Now it seems that Pig is pushing too much data to Elasticsearch and it can't keep up.
My cluster is running 134 tasks at once for this Pig job. Is there any easy way to change the number of simultaneous tasks running for a Pig/Tez job? I changed my queue configuration to limit the resources to 50% for one of the queues and I'm still overloading Elasticsearch.
Created ‎12-22-2015 12:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data.
#start pig with
pig -x mr
#in your pig script
set mapreduce.input.fileinputformat.split.minsize= N
set mapreduce.input.fileinputformat.split.maxsize= X
Created ‎12-15-2015 11:02 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you tried setting mapreduce.jobtracker.maxtasks.perjob for your pig application?
Alternatively you can use node labels to run your pig job on specific nodes.
Created ‎12-21-2015 03:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my searching for ways to reduce parallel tasks, I had not yet seen that. I will give it a try. Thank you!
Created ‎12-22-2015 12:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data.
#start pig with
pig -x mr
#in your pig script
set mapreduce.input.fileinputformat.split.minsize= N
set mapreduce.input.fileinputformat.split.maxsize= X
Created ‎12-22-2015 10:27 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think this is probably the better approach. We were initially using Tez for the better performance over M/R. However with our cluster easily throttling Elasticsearch, it seems reasonable to revert back to M/R and tweak settings that are easier to control.
Thank you.
