Support Questions

Find answers, ask questions, and share your expertise

How to create a hadoop custom inputformat/fileinputformat

avatar
Contributor

Any knows or have a tutorial?

1 ACCEPTED SOLUTION

avatar

Please go through the basic understanding of InputFormats and Record Readers

http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/
http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/

Example of custom Input Formats

http://bytepadding.com/big-data/spark/combineparquetfileinputformat/


Few pointers:
1. Start with basic understanding of Splits, InputFormats, Record Readers, File formats and compression.
2. Go through the code of TextInputFormat : http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...

3. FileInputFormat is the abstract class or the Base class for all input formats go through the basic functionality

http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...

4. Decide upon what is the logical record for your InputFormat and whats the splitting Strateggy, depending on this extend the

FileInputFormat and override/ implement getSplits() and getRecordReader() methods.

FileInputFormat important method:
getSplits() : each task will read one split, what is the start file index and end inex for this split
getRecordReader() : the split being read how bytes needs to be converted into bytes.


View solution in original post

2 REPLIES 2

avatar

Please go through the basic understanding of InputFormats and Record Readers

http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/
http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/

Example of custom Input Formats

http://bytepadding.com/big-data/spark/combineparquetfileinputformat/


Few pointers:
1. Start with basic understanding of Splits, InputFormats, Record Readers, File formats and compression.
2. Go through the code of TextInputFormat : http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...

3. FileInputFormat is the abstract class or the Base class for all input formats go through the basic functionality

http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...

4. Decide upon what is the logical record for your InputFormat and whats the splitting Strateggy, depending on this extend the

FileInputFormat and override/ implement getSplits() and getRecordReader() methods.

FileInputFormat important method:
getSplits() : each task will read one split, what is the start file index and end inex for this split
getRecordReader() : the split being read how bytes needs to be converted into bytes.


avatar
Contributor

thanks @kgautam this is really helpful.