About christophe_daco

christophe_daco · ‎03-29-2017

Hello all, Thanks @ssingla and @Umair Khan for your answers. Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file. In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file.

christophe_daco · ‎03-27-2017

Hello everybody, I have a big file in HDFS (~20Gb) on which I usually execute a MapReduce job. Around 170 mappers are created. The InputFormat used is a FileInputFormat. Now I would like to execute the MapReduce job only on a part of the file (for example, the first 40Mb of the file). Is there a simple way to perform this? Thanks for your help.

Online	Offline
Last Visited	‎03-29-2017 07:46 AM

Member Since	‎08-26-2016 09:33 AM
Last Visited	‎03-29-2017 07:46 AM
Posts	4

Cloudera Community

Re: Execute MapReduce job only on a part of a HDFS...

Re: Execute MapReduce job only on a part of a HDFS...

Execute MapReduce job only on a part of a HDFS fil...