- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Does the TaskTracker spawns a new Mapper for each input split or for each key-value pair?
- Labels:
-
Apache Hadoop
-
Apache YARN
Created 12-28-2015 07:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As per the The Definitive Guide-
- Mapper as in the Map task spawned by the Tasktracker in a separate JVM to process an input split. ( all of it ). For TextInputFormat , this would be a specific number of lines from your input file.
- Map method that is called for every record(key-value pair) in the split. Mapper.map(...) . In case of TextInputFormat, each map method (invocation)will process a line in your input split
With the above consideration the TaskTracker spawns a new Mapper for each input split.
But if you look at the Mapper class code-
public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
It means the Mapper class/object will take one key/value pair each time, when this k/v pair is been processed, the class/object is done, it is finished. Next k/v pair will be processed by another Mapper, a new class/object.
For Example, Think of 64MB block size contains 1000 records(key-value pairs). does the framework creates 1000 mapper here or just a single mapper.
This is little confusing. Can any one highlight more on whats exactly happens in this case.
Thanks in advance.
Created 12-28-2015 09:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Gangadhar Kadam For each input split or file block, one map task is initiated. It doesn't depend on number of records(K, V pairs) in that block or input split. So, if you have m blocks or input splits, at least m map tasks will be initiated. It can be more than m, if you have speculative execution turned on.
w.r.t. your example, if your file of size 64MB has 1000 records and occupies one block, then only one map task would triggered.
Created 12-28-2015 09:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Gangadhar Kadam For each input split or file block, one map task is initiated. It doesn't depend on number of records(K, V pairs) in that block or input split. So, if you have m blocks or input splits, at least m map tasks will be initiated. It can be more than m, if you have speculative execution turned on.
w.r.t. your example, if your file of size 64MB has 1000 records and occupies one block, then only one map task would triggered.
Created 12-29-2015 01:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Pradeep!
Created 12-30-2015 06:27 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Gangadhar Kadam As a best practice, please accept the answer if you are satisfied with answer. Then, we can close this question.