Archives of Support Questions (Read Only)

melvinmendoza · ‎03-07-2018

Any knows or have a tutorial?

kgautam · ‎03-07-2018

Please go through the basic understanding of InputFormats and Record Readers

http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/
http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/

Example of custom Input Formats

http://bytepadding.com/big-data/spark/combineparquetfileinputformat/

Few pointers:
1. Start with basic understanding of Splits, InputFormats, Record Readers, File formats and compression.
2. Go through the code of TextInputFormat : http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...

3. FileInputFormat is the abstract class or the Base class for all input formats go through the basic functionality

http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...

4. Decide upon what is the logical record for your InputFormat and whats the splitting Strateggy, depending on this extend the

FileInputFormat and override/ implement getSplits() and getRecordReader() methods.

FileInputFormat important method:
getSplits() : each task will read one split, what is the start file index and end inex for this split
getRecordReader() : the split being read how bytes needs to be converted into bytes.

View solution in original post

kgautam · ‎03-07-2018