Support Questions
Find answers, ask questions, and share your expertise

What are the common jobs where number of reducers will be more than number mappers?

Solved Go to solution

What are the common jobs where number of reducers will be more than number mappers?

Explorer

What are the common jobs where number of reducers will be more than number mappers?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: What are the common jobs where number of reducers will be more than number mappers?

Etc. Normally you have more mappers for two reasons:

a) in most analytical tasks you can filter out a huge percentage of the data at the source

b) If you can choose where to compute things its better to do it in the mapper.

Therefore you would want more reducers for any task where you do heavy tasks after a group by/join and you cannot filter out data in the mapper.

Things I could think of:

Running DataMining inside MapReduce to for example create one forecast model per product. In that case reading the data in the mapper is trivial but the modelling step running in the reducer is heavy so you would want more reducers than mappers.

Inserting data into a ( partitioned ) ORC Hive table:

Creating ORC files is pretty heavy and you want one reducer per partition and potentially a couple files for each. While reading a delimited file is very lightweight, so here you also want more reducers than mappers.

...

View solution in original post

1 REPLY 1
Highlighted

Re: What are the common jobs where number of reducers will be more than number mappers?

Etc. Normally you have more mappers for two reasons:

a) in most analytical tasks you can filter out a huge percentage of the data at the source

b) If you can choose where to compute things its better to do it in the mapper.

Therefore you would want more reducers for any task where you do heavy tasks after a group by/join and you cannot filter out data in the mapper.

Things I could think of:

Running DataMining inside MapReduce to for example create one forecast model per product. In that case reading the data in the mapper is trivial but the modelling step running in the reducer is heavy so you would want more reducers than mappers.

Inserting data into a ( partitioned ) ORC Hive table:

Creating ORC files is pretty heavy and you want one reducer per partition and potentially a couple files for each. While reading a delimited file is very lightweight, so here you also want more reducers than mappers.

...

View solution in original post