Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to create a hadoop custom inputformat/fileinputformat

avatar

Any knows or have a tutorial?

1 ACCEPTED SOLUTION

avatar

Please go through the basic understanding of InputFormats and Record Readers

http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/
http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/

Example of custom Input Formats

http://bytepadding.com/big-data/spark/combineparquetfileinputformat/


Few pointers:
1. Start with basic understanding of Splits, InputFormats, Record Readers, File formats and compression.
2. Go through the code of TextInputFormat : http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...

3. FileInputFormat is the abstract class or the Base class for all input formats go through the basic functionality

http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...

4. Decide upon what is the logical record for your InputFormat and whats the splitting Strateggy, depending on this extend the

FileInputFormat and override/ implement getSplits() and getRecordReader() methods.

FileInputFormat important method:
getSplits() : each task will read one split, what is the start file index and end inex for this split
getRecordReader() : the split being read how bytes needs to be converted into bytes.


View solution in original post

2 REPLIES 2

avatar

Please go through the basic understanding of InputFormats and Record Readers

http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/
http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/

Example of custom Input Formats

http://bytepadding.com/big-data/spark/combineparquetfileinputformat/


Few pointers:
1. Start with basic understanding of Splits, InputFormats, Record Readers, File formats and compression.
2. Go through the code of TextInputFormat : http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...

3. FileInputFormat is the abstract class or the Base class for all input formats go through the basic functionality

http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...

4. Decide upon what is the logical record for your InputFormat and whats the splitting Strateggy, depending on this extend the

FileInputFormat and override/ implement getSplits() and getRecordReader() methods.

FileInputFormat important method:
getSplits() : each task will read one split, what is the start file index and end inex for this split
getRecordReader() : the split being read how bytes needs to be converted into bytes.


avatar

thanks @kgautam this is really helpful.