Created on 07-26-2016 07:16 AM - edited 09-16-2022 03:31 AM
If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??
"Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the shared filesystem in a directory named after the job ID (step 3).".
Above is the line form Hadoop Definitive guide.
And how map works if the split spans over data blocks in two different data nodes??
Created 08-12-2016 01:02 PM
"If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??"
InputSplits are simply the work assignments of a mapper.
I.e. you have the inputfolder
/in/file1 /in/file2
And assume file1 has 200MB and file2 100MB ( default block size 128MB )
So the InputFormat per default will generate 3 input splits ( on the appmaster its a function of InputFormat)
InputSplit1: /in/file1:0:128000000 InputSplit2: /in/file1:128000001:200000000 InputSplit3:/in/file2:0:100000000
( per default one split = 1 block but he COULD do whatever he wants. He does this for example for small files where he uses MultiFileInputSplits which span multiple files )
"And how map works if the split spans over data blocks in two different data nodes??"
So the mapper comes up ( normally locally to the block ) and starts reading the file with the offset provided. HDFS by definition is global and if you read non local parts of a file he will read it over the network but local is obviously more efficient. But he COULD read anything. The HDFS API makes it transparent. So NORMALLY the InputSplit generation will be done in a way that this does not happen. So data can be read locally but its not a necessary precondition. Often maps are non local ( you can see that in the resource manager ) and then he can simply read the data over the network. The API call is identical. Reading an HDFS file in Java is the same as reading a local file. Its just an extension to the Java FileSystem API.
Created 07-26-2016 07:40 AM
Once client submit the request , YARN create the App Master,
While creating AppMaster it occupy the maximum Available memory and cores , container will be created.
1)During the Map task , it will read inputsplits data on jar (by default text input format), if it is 1 gb data with 256 MB block size, 10 splits will be created.
2) Inputs splits are read by Linerecordreader , linereocrd is able read data from FSDataInputStream, it will till it complete the all input splits for MAP task,
3) Once it complete MAP task with Linerecord , Recordreader read completed and reducer task will run on it.
Created 07-26-2016 11:19 AM
so is it like it will read all the data 1GB and then split the data into logical splits and assign map task to it??
Then what are the computed input splits placed in HDFS while job being submitted... at that AppMaster will not be even launched.
and how come 1 GB file will be divided into 10 splits if the block size is 256?? the division is based on splitsize which can be configurable (as of my knowledge).
Created 07-27-2016 02:30 AM
1) AppMaster will launch one Maptask for each map splits ,there is map splits for each input fils. If the input file is too big(bigger than Block Size) then we have two or more map splits assoicated to same input file.
2)AppMaster will be launched first and create Maptask for each input splits
3) Correcting it was typo error 1 GB , it has 4 splits with block size 256 MB , for each Mapsplits it ask for 1 container in MR1 and where MR2 with Tez it use 1 container for its job.
Created 08-01-2016 10:13 AM
Please find more information
InputFormat describes the input-specification for a MapReduce job.
The MapReduce framework relies on the InputFormat of the job to:
The default behavior of file-based InputFormat implementations, typically sub-classes of FileInputFormat, is to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.
Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.
TextInputFormat is the default InputFormat.
if this is help full coments and accept are appreciated.
Created 08-01-2016 10:15 AM
Please find more information on Apache hadoop org
Created 08-01-2016 03:30 PM
That is what am saying. So " where this happens? " is my question.
Created 08-03-2016 06:15 AM
Yes this happens on Slaves Nodes (Datanodes or HDFS nodes only)
Created 08-03-2016 12:33 PM
Am not feeling good to say this. But am not satisfied with you answer. It is fine that application master doing the job of calling inputformat() adn calcuating the input splits and goes on. But am asking what is the meaning of the sentence quoted in the Definitive guide that client places computed inputsplits in HDFS.
Am sorry if i am unable to explain my doubt properly.
Created 08-03-2016 05:45 PM
Thank you for your Opinion , this below information be help full on Input Splits
InputFormat describes the input-specification for a MapReduce job.
The MapReduce framework relies on the InputFormat of the job to:
The default behavior of file-based InputFormat implementations, typically sub-classes of FileInputFormat, is to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.
Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.
TextInputFormat is the default InputFormat.
if this is help full coments and accept are appreciated.