What is the difference between HDFS block and InputSplit.
I Hadoop HDFS stores each file as a block and distribute across the nodes in a cluster.
The default size of the HDFS block is 128 MB which we can configure as per our requirement. All blocks of the file are of the same size except the last block, which can be of same size or smaller. The files are split into 128 MB blocks and then stored into Hadoop FileSystem.
It is the physical representation of data. It contains a minimum amount of data that can be read or write.
Data to be processed by mapper is represented by InputSplit. Initially, data for MapReduce task is present in input files in HDFS. InputFormat is used to define how these input files are split and read. InputFormat is responsible to create InputSplit.
By default, split size is approximately equal to block size. InputSplit is user-defined and the user can control split size based on the size of data in MapReduce program.
It is the logical representation of data present in the block. It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data.
InputSplit is only a logical chunk of data i.e. it has just the information about blocks address or location.