Created 12-13-2017 12:24 PM
We have inputsplit parameter and block-size is hadoop, why these two parameter required and what the use ?
Block Size :- dfs.block.size :
Inputsplitsize : while job running it takes.
Why we required two parameter in hadoop cluster ?
Created 12-15-2017 04:35 AM
Block Size:
Physical Location where the data been stored i.e default size of the HDFS block is 128 MB which we can configure as per our requirement.
All blocks of the file are of the same size except the last block, which can be of same size or smaller.
The files are split into 128 MB blocks and then stored into Hadoop FileSystem.
in HDFS each file will be divided into blocks based on configuration of the size of block and Hadoop application will distributes those blocks across the cluster.
The main aim of splitting the file and storing them across the cluster is to get more parallelism and replication factor is helpful to get fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network.
Input Split:-
Logical representation of Block or more/lesser than a Block size
It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data.
During MapReduce execution, Hadoop scans through the blocks and create InputSplits and each inputSplit will be assigned to individual mappers for processing. Split act as a broker between block and mapper.
Let's take If we are have 1.2GB file divided into 10 blocks i.e each block is almost 128 MB.
InputFormat.getSplits() is responsible for generating the input splits which are going to be used each split as input for each mapper. By default this class is going to create one input split for each HDFS block.
Created 12-15-2017 04:35 AM
Block Size:
Physical Location where the data been stored i.e default size of the HDFS block is 128 MB which we can configure as per our requirement.
All blocks of the file are of the same size except the last block, which can be of same size or smaller.
The files are split into 128 MB blocks and then stored into Hadoop FileSystem.
in HDFS each file will be divided into blocks based on configuration of the size of block and Hadoop application will distributes those blocks across the cluster.
The main aim of splitting the file and storing them across the cluster is to get more parallelism and replication factor is helpful to get fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network.
Input Split:-
Logical representation of Block or more/lesser than a Block size
It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data.
During MapReduce execution, Hadoop scans through the blocks and create InputSplits and each inputSplit will be assigned to individual mappers for processing. Split act as a broker between block and mapper.
Let's take If we are have 1.2GB file divided into 10 blocks i.e each block is almost 128 MB.
InputFormat.getSplits() is responsible for generating the input splits which are going to be used each split as input for each mapper. By default this class is going to create one input split for each HDFS block.
Created 12-17-2017 08:04 AM
your comments are appreciated, thanks you.
as you mentioned and in addition we can change the input splits size according to our requirement by using the the below parameters.
MAPRED.MAX.SPLIT.SIZE :- If we want to increase the inputsplit size ,use this parameter while running the job. DFS.BLOCK.SIZE :- Is global HDFS block size parameter, while storing the data in cluster .