Support Questions
Find answers, ask questions, and share your expertise

where an when does the fileinputformat() runs.?

Contributor

If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??

"Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the shared filesystem in a directory named after the job ID (step 3).".

Above is the line form Hadoop Definitive guide.

And how map works if the split spans over data blocks in two different data nodes??

1 ACCEPTED SOLUTION

"If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??"

InputSplits are simply the work assignments of a mapper.

I.e. you have the inputfolder

/in/file1 /in/file2

And assume file1 has 200MB and file2 100MB ( default block size 128MB )

So the InputFormat per default will generate 3 input splits ( on the appmaster its a function of InputFormat)

InputSplit1: /in/file1:0:128000000 InputSplit2: /in/file1:128000001:200000000 InputSplit3:/in/file2:0:100000000

( per default one split = 1 block but he COULD do whatever he wants. He does this for example for small files where he uses MultiFileInputSplits which span multiple files )

"And how map works if the split spans over data blocks in two different data nodes??"

So the mapper comes up ( normally locally to the block ) and starts reading the file with the offset provided. HDFS by definition is global and if you read non local parts of a file he will read it over the network but local is obviously more efficient. But he COULD read anything. The HDFS API makes it transparent. So NORMALLY the InputSplit generation will be done in a way that this does not happen. So data can be read locally but its not a necessary precondition. Often maps are non local ( you can see that in the resource manager ) and then he can simply read the data over the network. The API call is identical. Reading an HDFS file in Java is the same as reading a local file. Its just an extension to the Java FileSystem API.

View solution in original post

10 REPLIES 10

Expert Contributor

Once client submit the request , YARN create the App Master,

While creating AppMaster it occupy the maximum Available memory and cores , container will be created.

1)During the Map task , it will read inputsplits data on jar (by default text input format), if it is 1 gb data with 256 MB block size, 10 splits will be created.

2) Inputs splits are read by Linerecordreader , linereocrd is able read data from FSDataInputStream, it will till it complete the all input splits for MAP task,

3) Once it complete MAP task with Linerecord , Recordreader read completed and reducer task will run on it.

Contributor

so is it like it will read all the data 1GB and then split the data into logical splits and assign map task to it??

Then what are the computed input splits placed in HDFS while job being submitted... at that AppMaster will not be even launched.

and how come 1 GB file will be divided into 10 splits if the block size is 256?? the division is based on splitsize which can be configurable (as of my knowledge).

Expert Contributor

1) AppMaster will launch one Maptask for each map splits ,there is map splits for each input fils. If the input file is too big(bigger than Block Size) then we have two or more map splits assoicated to same input file.

2)AppMaster will be launched first and create Maptask for each input splits

3) Correcting it was typo error 1 GB , it has 4 splits with block size 256 MB , for each Mapsplits it ask for 1 container in MR1 and where MR2 with Tez it use 1 container for its job.

Expert Contributor

Please find more information

InputFormat describes the input-specification for a MapReduce job.

The MapReduce framework relies on the InputFormat of the job to:

  1. Validate the input-specification of the job.
  2. Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.
  3. Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.

The default behavior of file-based InputFormat implementations, typically sub-classes of FileInputFormat, is to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.

Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.

TextInputFormat is the default InputFormat.

  • The Hadoop job client then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the slaves(HDFS or Datanodes), scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

if this is help full coments and accept are appreciated.

Expert Contributor

Contributor

@Shiv kumar

That is what am saying. So " where this happens? " is my question.

Expert Contributor

Yes this happens on Slaves Nodes (Datanodes or HDFS nodes only)

Contributor

Am not feeling good to say this. But am not satisfied with you answer. It is fine that application master doing the job of calling inputformat() adn calcuating the input splits and goes on. But am asking what is the meaning of the sentence quoted in the Definitive guide that client places computed inputsplits in HDFS.

Am sorry if i am unable to explain my doubt properly.

Expert Contributor

Thank you for your Opinion , this below information be help full on Input Splits

InputFormat describes the input-specification for a MapReduce job.

The MapReduce framework relies on the InputFormat of the job to:

  1. Validate the input-specification of the job.
  2. Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.
  3. Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.

The default behavior of file-based InputFormat implementations, typically sub-classes of FileInputFormat, is to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.

Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.

TextInputFormat is the default InputFormat.

  • The Hadoop job client then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the slaves( Datanodes), scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

if this is help full coments and accept are appreciated.

"If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??"

InputSplits are simply the work assignments of a mapper.

I.e. you have the inputfolder

/in/file1 /in/file2

And assume file1 has 200MB and file2 100MB ( default block size 128MB )

So the InputFormat per default will generate 3 input splits ( on the appmaster its a function of InputFormat)

InputSplit1: /in/file1:0:128000000 InputSplit2: /in/file1:128000001:200000000 InputSplit3:/in/file2:0:100000000

( per default one split = 1 block but he COULD do whatever he wants. He does this for example for small files where he uses MultiFileInputSplits which span multiple files )

"And how map works if the split spans over data blocks in two different data nodes??"

So the mapper comes up ( normally locally to the block ) and starts reading the file with the offset provided. HDFS by definition is global and if you read non local parts of a file he will read it over the network but local is obviously more efficient. But he COULD read anything. The HDFS API makes it transparent. So NORMALLY the InputSplit generation will be done in a way that this does not happen. So data can be read locally but its not a necessary precondition. Often maps are non local ( you can see that in the resource manager ) and then he can simply read the data over the network. The API call is identical. Reading an HDFS file in Java is the same as reading a local file. Its just an extension to the Java FileSystem API.

; ;