Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Execute MapReduce job only on a part of a HDFS file

New Contributor

Hello everybody,

I have a big file in HDFS (~20Gb) on which I usually execute a MapReduce job. Around 170 mappers are created. The InputFormat used is a FileInputFormat.

Now I would like to execute the MapReduce job only on a part of the file (for example, the first 40Mb of the file).

Is there a simple way to perform this?

Thanks for your help.

1 ACCEPTED SOLUTION

New Contributor

Hello all,

Thanks @ssingla and @Umair Khan for your answers.

Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file.

In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file.

View solution in original post

3 REPLIES 3

Expert Contributor

Try something like:

hadoop fs -cat /path_to_hdfs_file/test.csv | head -c 40000000

Cloudera Employee

I would recommend to split up the file and then the MR job of yours on each of the file.

New Contributor

Hello all,

Thanks @ssingla and @Umair Khan for your answers.

Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file.

In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.