<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Execute MapReduce job only on a part of a HDFS file in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Execute-MapReduce-job-only-on-a-part-of-a-HDFS-file/m-p/175401#M137660</link>
    <description>&lt;P&gt;Try something like:&lt;/P&gt;&lt;P&gt;hadoop fs -cat /path_to_hdfs_file/test.csv | head -c 40000000&lt;/P&gt;</description>
    <pubDate>Tue, 28 Mar 2017 04:20:23 GMT</pubDate>
    <dc:creator>umair_khan</dc:creator>
    <dc:date>2017-03-28T04:20:23Z</dc:date>
    <item>
      <title>Execute MapReduce job only on a part of a HDFS file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Execute-MapReduce-job-only-on-a-part-of-a-HDFS-file/m-p/175400#M137659</link>
      <description>&lt;P&gt;Hello everybody, &lt;/P&gt;&lt;P&gt;I have a big file in HDFS (~20Gb) on which I usually execute a MapReduce job. Around 170 mappers are created. The InputFormat used is a FileInputFormat.&lt;/P&gt;&lt;P&gt;Now I would like to execute the MapReduce job only on a part of the file (for example, the first 40Mb of the file).&lt;/P&gt;&lt;P&gt;Is there a simple way to perform this? &lt;/P&gt;&lt;P&gt;Thanks for your help.&lt;/P&gt;</description>
      <pubDate>Mon, 27 Mar 2017 21:08:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Execute-MapReduce-job-only-on-a-part-of-a-HDFS-file/m-p/175400#M137659</guid>
      <dc:creator>christophe_daco</dc:creator>
      <dc:date>2017-03-27T21:08:53Z</dc:date>
    </item>
    <item>
      <title>Re: Execute MapReduce job only on a part of a HDFS file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Execute-MapReduce-job-only-on-a-part-of-a-HDFS-file/m-p/175401#M137660</link>
      <description>&lt;P&gt;Try something like:&lt;/P&gt;&lt;P&gt;hadoop fs -cat /path_to_hdfs_file/test.csv | head -c 40000000&lt;/P&gt;</description>
      <pubDate>Tue, 28 Mar 2017 04:20:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Execute-MapReduce-job-only-on-a-part-of-a-HDFS-file/m-p/175401#M137660</guid>
      <dc:creator>umair_khan</dc:creator>
      <dc:date>2017-03-28T04:20:23Z</dc:date>
    </item>
    <item>
      <title>Re: Execute MapReduce job only on a part of a HDFS file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Execute-MapReduce-job-only-on-a-part-of-a-HDFS-file/m-p/175402#M137661</link>
      <description>&lt;P&gt;I would recommend to split up the file and then the MR job of yours on each of the file.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Mar 2017 07:14:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Execute-MapReduce-job-only-on-a-part-of-a-HDFS-file/m-p/175402#M137661</guid>
      <dc:creator>ssingla</dc:creator>
      <dc:date>2017-03-28T07:14:13Z</dc:date>
    </item>
    <item>
      <title>Re: Execute MapReduce job only on a part of a HDFS file</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Execute-MapReduce-job-only-on-a-part-of-a-HDFS-file/m-p/175403#M137662</link>
      <description>&lt;P&gt;Hello all, &lt;/P&gt;&lt;P&gt;Thanks @&lt;A href="https://community.hortonworks.com/users/309/ssingla.html"&gt;ssingla&lt;/A&gt; and @&lt;A href="https://community.hortonworks.com/users/2827/umairkhan.html"&gt;Umair Khan&lt;/A&gt; for your answers. &lt;/P&gt;&lt;P&gt;Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file. &lt;/P&gt;&lt;P&gt;In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Mar 2017 14:46:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Execute-MapReduce-job-only-on-a-part-of-a-HDFS-file/m-p/175403#M137662</guid>
      <dc:creator>christophe_daco</dc:creator>
      <dc:date>2017-03-29T14:46:05Z</dc:date>
    </item>
  </channel>
</rss>

