Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

MapReduce: FixedRecordReader - Partial record found at the end of split

avatar
Rising Star

Hi,

I am not an expertise in Java and trying to analyse a FixedInputFormat and FixedRecordReader to customize in the project.

I copied both the classes from the below GitHub link and testing through Driver and mapper class

https://github.com/apache/hadoop/tree/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-...

The input is a fixedlengthformat like this:
1234abcvd123mnfvds6722
6543abcad123aewert1234
While running this I get the error: Partial record found at the end of split.
The inputsplit has considered newline and calculated the splitlength as 46 instead of 44 and calculates 3 records instead of 2.
How could the newline character be avoided from the input split? I appreciate any help on this.

Thank you

1 ACCEPTED SOLUTION

avatar

Please have a look at the code of FixedInputFormat as provided in the github.

The basic criteria is that each record should be of the same length. What it means is each record in your file should be of length "fixedlengthinputformat.record.length" and record includes the delimiter too .

1. Please do understand TextInputFormat was created for reading a file with records which are delimited.

2. There can be a file which has multiple "fixed length records" without any delimiter.

We save on disk space as idea of delimiter is redundant in these files.

Only record.length length determines where one record end and where the next starts .

It looks a like a file with one big row hence we use FixedInputFormat .


Two solutions :

1. provide fixedlengthinputformat.record.length in conf object and set it to 23. Remove the delimiter in the map method.

<code>Configuration conf =newConfiguration(true);
conf.set("fs.default.name","file:///");
conf.setInt("fixedlengthinputformat.record.length",23);
job.setInputFormatClass(FixedLengthInputFormat.class);

2. Use TextInputFormat , but it will do no records length checks that they are of same length , which you will have to do inside your map method.

View solution in original post

2 REPLIES 2

avatar

Please have a look at the code of FixedInputFormat as provided in the github.

The basic criteria is that each record should be of the same length. What it means is each record in your file should be of length "fixedlengthinputformat.record.length" and record includes the delimiter too .

1. Please do understand TextInputFormat was created for reading a file with records which are delimited.

2. There can be a file which has multiple "fixed length records" without any delimiter.

We save on disk space as idea of delimiter is redundant in these files.

Only record.length length determines where one record end and where the next starts .

It looks a like a file with one big row hence we use FixedInputFormat .


Two solutions :

1. provide fixedlengthinputformat.record.length in conf object and set it to 23. Remove the delimiter in the map method.

<code>Configuration conf =newConfiguration(true);
conf.set("fs.default.name","file:///");
conf.setInt("fixedlengthinputformat.record.length",23);
job.setInputFormatClass(FixedLengthInputFormat.class);

2. Use TextInputFormat , but it will do no records length checks that they are of same length , which you will have to do inside your map method.

avatar
Rising Star

Thanks for the suggestion. I am opting for the 2nd solution as the data is not one big continuous row and 1st solution did not work.