Support Questions

Find answers, ask questions, and share your expertise

where an when does the fileinputformat() runs.?

avatar
Expert Contributor

If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??

"Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the shared filesystem in a directory named after the job ID (step 3).".

Above is the line form Hadoop Definitive guide.

And how map works if the split spans over data blocks in two different data nodes??

1 ACCEPTED SOLUTION

avatar
Master Guru

"If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??"

InputSplits are simply the work assignments of a mapper.

I.e. you have the inputfolder

/in/file1 /in/file2

And assume file1 has 200MB and file2 100MB ( default block size 128MB )

So the InputFormat per default will generate 3 input splits ( on the appmaster its a function of InputFormat)

InputSplit1: /in/file1:0:128000000 InputSplit2: /in/file1:128000001:200000000 InputSplit3:/in/file2:0:100000000

( per default one split = 1 block but he COULD do whatever he wants. He does this for example for small files where he uses MultiFileInputSplits which span multiple files )

"And how map works if the split spans over data blocks in two different data nodes??"

So the mapper comes up ( normally locally to the block ) and starts reading the file with the offset provided. HDFS by definition is global and if you read non local parts of a file he will read it over the network but local is obviously more efficient. But he COULD read anything. The HDFS API makes it transparent. So NORMALLY the InputSplit generation will be done in a way that this does not happen. So data can be read locally but its not a necessary precondition. Often maps are non local ( you can see that in the resource manager ) and then he can simply read the data over the network. The API call is identical. Reading an HDFS file in Java is the same as reading a local file. Its just an extension to the Java FileSystem API.

View solution in original post

10 REPLIES 10

avatar
Master Guru

"If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??"

InputSplits are simply the work assignments of a mapper.

I.e. you have the inputfolder

/in/file1 /in/file2

And assume file1 has 200MB and file2 100MB ( default block size 128MB )

So the InputFormat per default will generate 3 input splits ( on the appmaster its a function of InputFormat)

InputSplit1: /in/file1:0:128000000 InputSplit2: /in/file1:128000001:200000000 InputSplit3:/in/file2:0:100000000

( per default one split = 1 block but he COULD do whatever he wants. He does this for example for small files where he uses MultiFileInputSplits which span multiple files )

"And how map works if the split spans over data blocks in two different data nodes??"

So the mapper comes up ( normally locally to the block ) and starts reading the file with the offset provided. HDFS by definition is global and if you read non local parts of a file he will read it over the network but local is obviously more efficient. But he COULD read anything. The HDFS API makes it transparent. So NORMALLY the InputSplit generation will be done in a way that this does not happen. So data can be read locally but its not a necessary precondition. Often maps are non local ( you can see that in the resource manager ) and then he can simply read the data over the network. The API call is identical. Reading an HDFS file in Java is the same as reading a local file. Its just an extension to the Java FileSystem API.