question Re: Execute MapReduce job only on a part of a HDFS file in Support Questions

Execute MapReduce job only on a part of a HDFS file

christophe_daco — Mon, 27 Mar 2017 21:08:53 GMT

Hello everybody,

I have a big file in HDFS (~20Gb) on which I usually execute a MapReduce job. Around 170 mappers are created. The InputFormat used is a FileInputFormat.

Now I would like to execute the MapReduce job only on a part of the file (for example, the first 40Mb of the file).

Is there a simple way to perform this?

Thanks for your help.

Re: Execute MapReduce job only on a part of a HDFS file

umair_khan — Tue, 28 Mar 2017 04:20:23 GMT

Try something like:

hadoop fs -cat /path_to_hdfs_file/test.csv | head -c 40000000

Re: Execute MapReduce job only on a part of a HDFS file

ssingla — Tue, 28 Mar 2017 07:14:13 GMT

I would recommend to split up the file and then the MR job of yours on each of the file.

Re: Execute MapReduce job only on a part of a HDFS file

christophe_daco — Wed, 29 Mar 2017 14:46:05 GMT

Hello all,

Thanks @ssingla and @Umair Khan for your answers.

Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file.

In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file.