Created 10-31-2017 12:14 PM
Hi,
I am not an expertise in Java and trying to analyse a FixedInputFormat and FixedRecordReader to customize in the project.
I copied both the classes from the below GitHub link and testing through Driver and mapper class
https://github.com/apache/hadoop/tree/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-...
The input is a fixedlengthformat like this:
1234abcvd123mnfvds6722
6543abcad123aewert1234
While running this I get the error: Partial record found at the end of split.
The inputsplit has considered newline and calculated the splitlength as 46 instead of 44 and calculates 3 records instead of 2.
How could the newline character be avoided from the input split? I appreciate any help on this.
Created 10-31-2017 04:32 PM
Please have a look at the code of FixedInputFormat as provided in the github.
The basic criteria is that each record should be of the same length. What it means is each record in your file should be of length "fixedlengthinputformat.record.length" and record includes the delimiter too .
1. Please do understand TextInputFormat was created for reading a file with records which are delimited.
2. There can be a file which has multiple "fixed length records" without any delimiter.
We save on disk space as idea of delimiter is redundant in these files.
Only record.length length determines where one record end and where the next starts .
It looks a like a file with one big row hence we use FixedInputFormat .
Two solutions :
1. provide fixedlengthinputformat.record.length in conf object and set it to 23. Remove the delimiter in the map method.
<code>Configuration conf =newConfiguration(true); conf.set("fs.default.name","file:///"); conf.setInt("fixedlengthinputformat.record.length",23); job.setInputFormatClass(FixedLengthInputFormat.class);
2. Use TextInputFormat , but it will do no records length checks that they are of same length , which you will have to do inside your map method.
Created 10-31-2017 04:32 PM
Please have a look at the code of FixedInputFormat as provided in the github.
The basic criteria is that each record should be of the same length. What it means is each record in your file should be of length "fixedlengthinputformat.record.length" and record includes the delimiter too .
1. Please do understand TextInputFormat was created for reading a file with records which are delimited.
2. There can be a file which has multiple "fixed length records" without any delimiter.
We save on disk space as idea of delimiter is redundant in these files.
Only record.length length determines where one record end and where the next starts .
It looks a like a file with one big row hence we use FixedInputFormat .
Two solutions :
1. provide fixedlengthinputformat.record.length in conf object and set it to 23. Remove the delimiter in the map method.
<code>Configuration conf =newConfiguration(true); conf.set("fs.default.name","file:///"); conf.setInt("fixedlengthinputformat.record.length",23); job.setInputFormatClass(FixedLengthInputFormat.class);
2. Use TextInputFormat , but it will do no records length checks that they are of same length , which you will have to do inside your map method.
Created 10-31-2017 06:37 PM
Thanks for the suggestion. I am opting for the 2nd solution as the data is not one big continuous row and 1st solution did not work.