Created 03-27-2017 02:08 PM
Hello everybody,
I have a big file in HDFS (~20Gb) on which I usually execute a MapReduce job. Around 170 mappers are created. The InputFormat used is a FileInputFormat.
Now I would like to execute the MapReduce job only on a part of the file (for example, the first 40Mb of the file).
Is there a simple way to perform this?
Thanks for your help.
Created 03-29-2017 07:46 AM
Hello all,
Thanks @ssingla and @Umair Khan for your answers.
Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file.
In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file.
Created 03-27-2017 09:20 PM
Try something like:
hadoop fs -cat /path_to_hdfs_file/test.csv | head -c 40000000
Created 03-28-2017 12:14 AM
I would recommend to split up the file and then the MR job of yours on each of the file.
Created 03-29-2017 07:46 AM
Hello all,
Thanks @ssingla and @Umair Khan for your answers.
Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file.
In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file.