Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

We can set the number of reduce tasks for the MapReduce jobs generated by Pig by"set default parallel" or PARALLEL clause, but how set no of map tasks?

Solved Go to solution

We can set the number of reduce tasks for the MapReduce jobs generated by Pig by"set default parallel" or PARALLEL clause, but how set no of map tasks?

Rising Star
 
1 ACCEPTED SOLUTION

Accepted Solutions

Re: We can set the number of reduce tasks for the MapReduce jobs generated by Pig by"set default parallel" or PARALLEL clause, but how set no of map tasks?

There is actually a way to change the numbers of mappers in Pig. Pig uses a CombineFileInputFormat to merge small files into bigger map tasks. This is enabled by default and can be modified with the following parameters: For the rest what Artem said.
  • pig.maxCombinedSplitSize – Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined untill this size is reached.
  • pig.splitCombination – Turns combine split files on or off (set to “true” by default).
4 REPLIES 4

Re: We can set the number of reduce tasks for the MapReduce jobs generated by Pig by"set default parallel" or PARALLEL clause, but how set no of map tasks?

Mentor

You can't set number of mappers, it is determined by number of blocks in your dataset.

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduc...

Re: We can set the number of reduce tasks for the MapReduce jobs generated by Pig by"set default parallel" or PARALLEL clause, but how set no of map tasks?

Rising Star

thanks@Artem

Re: We can set the number of reduce tasks for the MapReduce jobs generated by Pig by"set default parallel" or PARALLEL clause, but how set no of map tasks?

There is actually a way to change the numbers of mappers in Pig. Pig uses a CombineFileInputFormat to merge small files into bigger map tasks. This is enabled by default and can be modified with the following parameters: For the rest what Artem said.
  • pig.maxCombinedSplitSize – Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined untill this size is reached.
  • pig.splitCombination – Turns combine split files on or off (set to “true” by default).

Re: We can set the number of reduce tasks for the MapReduce jobs generated by Pig by"set default parallel" or PARALLEL clause, but how set no of map tasks?

Rising Star
Don't have an account?
Coming from Hortonworks? Activate your account here