Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Execute MapReduce job only on a part of a HDFS file

Solved Go to solution
Highlighted

Execute MapReduce job only on a part of a HDFS file

New Contributor

Hello everybody,

I have a big file in HDFS (~20Gb) on which I usually execute a MapReduce job. Around 170 mappers are created. The InputFormat used is a FileInputFormat.

Now I would like to execute the MapReduce job only on a part of the file (for example, the first 40Mb of the file).

Is there a simple way to perform this?

Thanks for your help.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Execute MapReduce job only on a part of a HDFS file

New Contributor

Hello all,

Thanks @ssingla and @Umair Khan for your answers.

Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file.

In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file.

View solution in original post

3 REPLIES 3
Highlighted

Re: Execute MapReduce job only on a part of a HDFS file

Expert Contributor

Try something like:

hadoop fs -cat /path_to_hdfs_file/test.csv | head -c 40000000

Re: Execute MapReduce job only on a part of a HDFS file

Cloudera Employee

I would recommend to split up the file and then the MR job of yours on each of the file.

Highlighted

Re: Execute MapReduce job only on a part of a HDFS file

New Contributor

Hello all,

Thanks @ssingla and @Umair Khan for your answers.

Finally, I found a solution consisting of deriving the `FileInputFormat` class and overriding the `getSplits` method in order to get only the splits corresponding to the wanted part of the HDFS file.

In this method, I call the super class to get the splits generated by the `InputFileFormat` class. Thanks to the configuration of the job, I manage to get some information like the beginning of the HDFS file and the end of the HDFS file I wanted to read. Finally, the beginning and the end of all splits get from the `getSPlits` method of the super class are compared to the previous information and returned if they match the the wanted part of the HDFS file.

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here