Archives of Support Questions (Read Only)

christophe_daco · ‎03-27-2017

Hello everybody,

I have a big file in HDFS (~20Gb) on which I usually execute a MapReduce job. Around 170 mappers are created. The InputFormat used is a FileInputFormat.

Now I would like to execute the MapReduce job only on a part of the file (for example, the first 40Mb of the file).

Is there a simple way to perform this?

Thanks for your help.

christophe_daco · ‎03-29-2017

Hello all,

Thanks @ssingla and @Umair Khan for your answers.

Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file.

In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file.

View solution in original post

umair_khan · ‎03-27-2017

Try something like:

hadoop fs -cat /path_to_hdfs_file/test.csv | head -c 40000000

ssingla · ‎03-28-2017

I would recommend to split up the file and then the MR job of yours on each of the file.

christophe_daco · ‎03-29-2017

Hello all,

Thanks @ssingla and @Umair Khan for your answers.

Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file.

In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file.

Cloudera Community

Archives of Support Questions (Read Only)

Execute MapReduce job only on a part of a HDFS file