Created 03-07-2018 07:50 AM
Any knows or have a tutorial?
Created 03-07-2018 10:43 AM
Please go through the basic understanding of InputFormats and Record Readers
http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/
http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/
Example of custom Input Formats
http://bytepadding.com/big-data/spark/combineparquetfileinputformat/
Few pointers:
1. Start with basic understanding of Splits, InputFormats, Record Readers, File formats and compression.
2. Go through the code of TextInputFormat : http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...
3. FileInputFormat is the abstract class or the Base class for all input formats go through the basic functionality
http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...
4. Decide upon what is the logical record for your InputFormat and whats the splitting Strateggy, depending on this extend the
FileInputFormat and override/ implement getSplits() and getRecordReader() methods.
FileInputFormat important method:
getSplits() : each task will read one split, what is the start file index and end inex for this split
getRecordReader() : the split being read how bytes needs to be converted into bytes.
Created 03-07-2018 10:43 AM
Please go through the basic understanding of InputFormats and Record Readers
http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/
http://bytepadding.com/big-data/map-reduce/how-records-are-handled-map-reduce/
Example of custom Input Formats
http://bytepadding.com/big-data/spark/combineparquetfileinputformat/
Few pointers:
1. Start with basic understanding of Splits, InputFormats, Record Readers, File formats and compression.
2. Go through the code of TextInputFormat : http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...
3. FileInputFormat is the abstract class or the Base class for all input formats go through the basic functionality
http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.1.0-pre7/org/apache/had...
4. Decide upon what is the logical record for your InputFormat and whats the splitting Strateggy, depending on this extend the
FileInputFormat and override/ implement getSplits() and getRecordReader() methods.
FileInputFormat important method:
getSplits() : each task will read one split, what is the start file index and end inex for this split
getRecordReader() : the split being read how bytes needs to be converted into bytes.
Created 03-08-2018 01:04 AM
thanks @kgautam this is really helpful.