Created 12-15-2015 09:35 PM
I'm using the Elasticsearch Hadoop connector to push data via Pig to Elasticsearch. This process has worked very well for me on my 6 node cluster. I now have a 12 node cluster and I'm running HDP 2.3. Now it seems that Pig is pushing too much data to Elasticsearch and it can't keep up.
My cluster is running 134 tasks at once for this Pig job. Is there any easy way to change the number of simultaneous tasks running for a Pig/Tez job? I changed my queue configuration to limit the resources to 50% for one of the queues and I'm still overloading Elasticsearch.
Created 12-22-2015 12:54 PM
I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data.
#start pig with
pig -x mr
#in your pig script
set mapreduce.input.fileinputformat.split.minsize= N
set mapreduce.input.fileinputformat.split.maxsize= X
Created 12-15-2015 11:02 PM
Have you tried setting mapreduce.jobtracker.maxtasks.perjob for your pig application?
Alternatively you can use node labels to run your pig job on specific nodes.
Created 12-21-2015 03:48 PM
In my searching for ways to reduce parallel tasks, I had not yet seen that. I will give it a try. Thank you!
Created 12-22-2015 12:54 PM
I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data.
#start pig with
pig -x mr
#in your pig script
set mapreduce.input.fileinputformat.split.minsize= N
set mapreduce.input.fileinputformat.split.maxsize= X
Created 12-22-2015 10:27 PM
I think this is probably the better approach. We were initially using Tez for the better performance over M/R. However with our cluster easily throttling Elasticsearch, it seems reasonable to revert back to M/R and tweak settings that are easier to control.
Thank you.