- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
What are the most common InputFormats in Hadoop?
- Labels:
-
Apache Hadoop
Created 02-02-2018 10:28 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created 02-03-2018 05:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Hadoop, Input files stores the data for a Map Reduce job. Input files which stores data typically reside in HDFS. Thus, in Map Reduce, Input Format defines how these input files split and read. Input Format creates Input split.
Most common Input Format are:
File Input Format-It is the base class for all file-based Input Format. It specifies input directory where data files are present. File Input Format also read all files. And, then divides these files into one or more Input Splits.
Text Input Format-It is the default Input Format of Map Reduce. It uses each line of each input file as separate record. Thus, performs no parsing.
- Key- byte offset.
- Value- It is the contents of the line, excluding line terminators.
Example content of file- is john may which katty
- Key- 0
- Value- is john may which katty
Key Value Text Input Format-It is similar to Text Input Format. Hence, it treats each line of input as a separate record. But the main difference is that Text Input Format treats entire line as the value. While the Key Value Text Input Format breaks the line itself into key and value by the tab character (‘/t’).
- Key- Everything up to tab character.
- Value- Remaining part of the line after tab character.
Example content of file- is -> john may which katty
- Key- is
- Value- john may which katty
Tab character “->”
Sequence File Input Format- It is the Input Format which reads sequence files. Key & Value- Both are user-defined.
Follow the link to learn more about Input Format in Hadoop
Created 02-02-2018 07:27 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The most common Input formats are
1. FileInputFormat (Base class for all)
2. TextInputFormat
3. KeyValueTextInputFormat
4. SequenceFileInputFormat
5. BinaryInputFormat
Created 02-03-2018 05:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Hadoop, Input files stores the data for a Map Reduce job. Input files which stores data typically reside in HDFS. Thus, in Map Reduce, Input Format defines how these input files split and read. Input Format creates Input split.
Most common Input Format are:
File Input Format-It is the base class for all file-based Input Format. It specifies input directory where data files are present. File Input Format also read all files. And, then divides these files into one or more Input Splits.
Text Input Format-It is the default Input Format of Map Reduce. It uses each line of each input file as separate record. Thus, performs no parsing.
- Key- byte offset.
- Value- It is the contents of the line, excluding line terminators.
Example content of file- is john may which katty
- Key- 0
- Value- is john may which katty
Key Value Text Input Format-It is similar to Text Input Format. Hence, it treats each line of input as a separate record. But the main difference is that Text Input Format treats entire line as the value. While the Key Value Text Input Format breaks the line itself into key and value by the tab character (‘/t’).
- Key- Everything up to tab character.
- Value- Remaining part of the line after tab character.
Example content of file- is -> john may which katty
- Key- is
- Value- john may which katty
Tab character “->”
Sequence File Input Format- It is the Input Format which reads sequence files. Key & Value- Both are user-defined.
Follow the link to learn more about Input Format in Hadoop
