<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: where an when does  the fileinputformat() runs.? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148141#M110670</link>
    <description>&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Thank you for your Opinion , this below information be help full on Input Splits &lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html"&gt;InputFormat&lt;/A&gt; describes the input-specification for a MapReduce job.&lt;/P&gt;&lt;P&gt;The MapReduce framework relies on the InputFormat of the job to:&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;Validate the input-specification of the job.&lt;/LI&gt;&lt;LI&gt;Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.&lt;/LI&gt;&lt;LI&gt;Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;The default behavior of file-based InputFormat implementations, typically sub-classes of &lt;A href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html"&gt;FileInputFormat&lt;/A&gt;, is to split the input into &lt;EM&gt;logical&lt;/EM&gt;
 InputSplit instances based on the total size, in bytes, of the input 
files. However, the FileSystem blocksize of the input files is treated 
as an upper bound for input splits. A lower bound on the split size can 
be set via mapreduce.input.fileinputformat.split.minsize.&lt;/P&gt;&lt;P&gt;Clearly, logical splits based on input-size is insufficient for many 
applications since record boundaries must be respected. In such cases, 
the application should implement a RecordReader, who is responsible for 
respecting record-boundaries and presents a record-oriented view of the 
logical InputSplit to the individual task.&lt;/P&gt;&lt;P&gt;&lt;A href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html"&gt;TextInputFormat&lt;/A&gt; is the default InputFormat.&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;&lt;EM&gt;&lt;EM&gt;The Hadoop job client then submits the job (jar/executable 
etc.) and configuration to the ResourceManager which then assumes the 
responsibility of distributing the software/configuration to the&lt;STRONG&gt; 
slaves( Datanodes)&lt;/STRONG&gt;, scheduling tasks and monitoring them, 
providing status and diagnostic information to the job-client.&lt;/EM&gt;&lt;/EM&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;if this is help full coments and accept are appreciated.&lt;/P&gt;</description>
    <pubDate>Thu, 04 Aug 2016 00:45:03 GMT</pubDate>
    <dc:creator>shivkumar82015</dc:creator>
    <dc:date>2016-08-04T00:45:03Z</dc:date>
    <item>
      <title>where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148132#M110661</link>
      <description>&lt;P&gt;If it runs in  the Appmaster, what exactly are "&lt;EM&gt;the computed input splits&lt;/EM&gt;"  that jobclient stores into HDFS while submitting the Job ??&lt;/P&gt;&lt;P&gt;
"&lt;EM&gt;Copies the resources needed to run the job, including the job JAR file, the
configuration file, and the computed input splits, to the shared filesystem in a directory
named after the job ID (step 3).&lt;/EM&gt;".&lt;/P&gt;&lt;P&gt;Above is the line form Hadoop Definitive guide.&lt;/P&gt;&lt;P&gt;And how map works if the split spans over data blocks in two different data nodes??&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 10:31:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148132#M110661</guid>
      <dc:creator>shivanageshch</dc:creator>
      <dc:date>2022-09-16T10:31:22Z</dc:date>
    </item>
    <item>
      <title>Re: where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148133#M110662</link>
      <description>&lt;P&gt;Once client submit the request , YARN create the App Master, &lt;/P&gt;&lt;P&gt;While creating AppMaster it occupy the maximum Available memory and cores , container will be created. &lt;/P&gt;&lt;P&gt;1)During the Map task , it will read inputsplits data on jar (by default text input format), if it is 1 gb data with 256 MB block size, 10 splits will be created.&lt;/P&gt;&lt;P&gt;2) Inputs splits are read by Linerecordreader , linereocrd is able read data from FSDataInputStream, it will till it complete the all input splits for MAP task,  &lt;/P&gt;&lt;P&gt;3) Once it complete MAP task with Linerecord , Recordreader read completed and reducer task will run on it.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2016 14:40:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148133#M110662</guid>
      <dc:creator>shivkumar82015</dc:creator>
      <dc:date>2016-07-26T14:40:41Z</dc:date>
    </item>
    <item>
      <title>Re: where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148134#M110663</link>
      <description>&lt;P&gt;so is it like it will read all the data 1GB and then split the data into logical splits and assign map task to it??&lt;/P&gt;&lt;P&gt;Then what are the computed input splits placed in HDFS while job being submitted... at that AppMaster will not be even launched.&lt;/P&gt;&lt;P&gt;and how come 1 GB file will be divided into 10 splits if the block size is 256?? the division is based on splitsize which can be configurable (as of my knowledge).&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2016 18:19:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148134#M110663</guid>
      <dc:creator>shivanageshch</dc:creator>
      <dc:date>2016-07-26T18:19:28Z</dc:date>
    </item>
    <item>
      <title>Re: where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148135#M110664</link>
      <description>&lt;P&gt;1) AppMaster will launch one Maptask for each map splits ,there is map splits for each input fils. If the input file is too big(bigger than Block Size) then we have two or more map splits assoicated to same input file.&lt;/P&gt;&lt;P&gt;2)AppMaster will be launched first and create Maptask for each input splits &lt;/P&gt;&lt;P&gt;3) Correcting it was typo error 1 GB , it has  4 splits with block size 256 MB , for each Mapsplits it ask for 1 container in MR1 and where MR2 with Tez it use 1 container for its job.&lt;/P&gt;</description>
      <pubDate>Wed, 27 Jul 2016 09:30:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148135#M110664</guid>
      <dc:creator>shivkumar82015</dc:creator>
      <dc:date>2016-07-27T09:30:22Z</dc:date>
    </item>
    <item>
      <title>Re: where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148136#M110665</link>
      <description>&lt;P&gt;Please find more information &lt;/P&gt;&lt;P&gt;&lt;A href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html"&gt;InputFormat&lt;/A&gt; describes the input-specification for a MapReduce job.&lt;/P&gt;&lt;P&gt;The MapReduce framework relies on the InputFormat of the job to:&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;Validate the input-specification of the job.&lt;/LI&gt;&lt;LI&gt;Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.&lt;/LI&gt;&lt;LI&gt;Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;The default behavior of file-based InputFormat implementations, typically sub-classes of &lt;A href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html"&gt;FileInputFormat&lt;/A&gt;, is to split the input into &lt;EM&gt;logical&lt;/EM&gt; InputSplit instances based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.&lt;/P&gt;&lt;P&gt;Clearly, logical splits based on input-size is insufficient for many 
applications since record boundaries must be respected. In such cases, 
the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.&lt;/P&gt;&lt;P&gt;&lt;A href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html"&gt;TextInputFormat&lt;/A&gt; is the default InputFormat.&lt;/P&gt;&lt;I&gt;&lt;/I&gt;&lt;I&gt;&lt;/I&gt;&lt;I&gt;&lt;/I&gt;&lt;I&gt;&lt;UL&gt;&lt;LI&gt;&lt;I&gt;The Hadoop job client then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the slaves(HDFS or Datanodes), scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.&lt;/I&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/I&gt;&lt;P&gt;if this is help full coments and accept are appreciated.&lt;/P&gt;</description>
      <pubDate>Mon, 01 Aug 2016 17:13:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148136#M110665</guid>
      <dc:creator>shivkumar82015</dc:creator>
      <dc:date>2016-08-01T17:13:50Z</dc:date>
    </item>
    <item>
      <title>Re: where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148137#M110666</link>
      <description>&lt;P&gt;Please find more information on Apache hadoop org&lt;/P&gt;&lt;P&gt;&lt;A href="https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Job_Input" target="_blank"&gt;https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Job_Input&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 01 Aug 2016 17:15:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148137#M110666</guid>
      <dc:creator>shivkumar82015</dc:creator>
      <dc:date>2016-08-01T17:15:44Z</dc:date>
    </item>
    <item>
      <title>Re: where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148138#M110667</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11907/shivkumar82015.html" nodeid="11907"&gt;@Shiv kumar&lt;/A&gt;&lt;/P&gt;&lt;P&gt;That is what am saying. So " where this happens? " is my question. &lt;/P&gt;</description>
      <pubDate>Mon, 01 Aug 2016 22:30:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148138#M110667</guid>
      <dc:creator>shivanageshch</dc:creator>
      <dc:date>2016-08-01T22:30:47Z</dc:date>
    </item>
    <item>
      <title>Re: where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148139#M110668</link>
      <description>&lt;P&gt;Yes this happens on Slaves Nodes (Datanodes or HDFS nodes only)&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2016 13:15:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148139#M110668</guid>
      <dc:creator>shivkumar82015</dc:creator>
      <dc:date>2016-08-03T13:15:33Z</dc:date>
    </item>
    <item>
      <title>Re: where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148140#M110669</link>
      <description>&lt;P&gt;Am not feeling good to say this. But am not satisfied with you answer. It is fine that application master doing the job of calling inputformat() adn calcuating the input splits and goes on. But am asking what is the meaning of  the sentence quoted in the Definitive guide that client places computed inputsplits in HDFS.&lt;/P&gt;&lt;P&gt;Am sorry if i am unable to explain my doubt properly.&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2016 19:33:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148140#M110669</guid>
      <dc:creator>shivanageshch</dc:creator>
      <dc:date>2016-08-03T19:33:06Z</dc:date>
    </item>
    <item>
      <title>Re: where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148141#M110670</link>
      <description>&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Thank you for your Opinion , this below information be help full on Input Splits &lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html"&gt;InputFormat&lt;/A&gt; describes the input-specification for a MapReduce job.&lt;/P&gt;&lt;P&gt;The MapReduce framework relies on the InputFormat of the job to:&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;Validate the input-specification of the job.&lt;/LI&gt;&lt;LI&gt;Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.&lt;/LI&gt;&lt;LI&gt;Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;The default behavior of file-based InputFormat implementations, typically sub-classes of &lt;A href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html"&gt;FileInputFormat&lt;/A&gt;, is to split the input into &lt;EM&gt;logical&lt;/EM&gt;
 InputSplit instances based on the total size, in bytes, of the input 
files. However, the FileSystem blocksize of the input files is treated 
as an upper bound for input splits. A lower bound on the split size can 
be set via mapreduce.input.fileinputformat.split.minsize.&lt;/P&gt;&lt;P&gt;Clearly, logical splits based on input-size is insufficient for many 
applications since record boundaries must be respected. In such cases, 
the application should implement a RecordReader, who is responsible for 
respecting record-boundaries and presents a record-oriented view of the 
logical InputSplit to the individual task.&lt;/P&gt;&lt;P&gt;&lt;A href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html"&gt;TextInputFormat&lt;/A&gt; is the default InputFormat.&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;&lt;EM&gt;&lt;EM&gt;The Hadoop job client then submits the job (jar/executable 
etc.) and configuration to the ResourceManager which then assumes the 
responsibility of distributing the software/configuration to the&lt;STRONG&gt; 
slaves( Datanodes)&lt;/STRONG&gt;, scheduling tasks and monitoring them, 
providing status and diagnostic information to the job-client.&lt;/EM&gt;&lt;/EM&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;if this is help full coments and accept are appreciated.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2016 00:45:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148141#M110670</guid>
      <dc:creator>shivkumar82015</dc:creator>
      <dc:date>2016-08-04T00:45:03Z</dc:date>
    </item>
    <item>
      <title>Re: where an when does  the fileinputformat() runs.?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148142#M110671</link>
      <description>&lt;P&gt;"If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??"&lt;/P&gt;&lt;P&gt;InputSplits are simply the work assignments of a mapper. &lt;/P&gt;&lt;P&gt;I.e. you have the inputfolder&lt;/P&gt;&lt;P&gt;/in/file1
/in/file2&lt;/P&gt;&lt;P&gt;And assume file1 has 200MB and file2 100MB ( default block size 128MB )&lt;/P&gt;&lt;P&gt;So the InputFormat per default will generate 3 input splits ( on the appmaster its a function of InputFormat)&lt;/P&gt;&lt;P&gt;InputSplit1: /in/file1:0:128000000
InputSplit2: /in/file1:128000001:200000000
InputSplit3:/in/file2:0:100000000&lt;/P&gt;&lt;P&gt;( per default one split = 1 block but he COULD do whatever he wants. He does this for example for small files where he uses MultiFileInputSplits which span multiple files )&lt;/P&gt;&lt;P&gt;"And how map works if the split spans over data blocks in two different data nodes??"&lt;/P&gt;&lt;P&gt;So the mapper comes up ( normally locally to the block ) and starts reading the file with the offset provided. HDFS by definition is global and if you read non local parts of a file he will read it over the network but local is obviously more efficient. But he COULD read anything. The HDFS API makes it transparent. So NORMALLY the InputSplit generation will be done in a way that this does not happen. So data can be read locally but its not a necessary precondition. Often maps are non local ( you can see that in the resource manager ) and then he can simply read the data over the network. The API call is identical. Reading an HDFS file in Java is the same as reading a local file. Its just an extension to the Java FileSystem API.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Aug 2016 20:02:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/where-an-when-does-the-fileinputformat-runs/m-p/148142#M110671</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-08-12T20:02:39Z</dc:date>
    </item>
  </channel>
</rss>

