Support Questions

Find answers, ask questions, and share your expertise

Difference between hadoop block Size and Input Splits in hadoop and why two parameter are there ?

avatar
Expert Contributor

We have inputsplit parameter and block-size is hadoop, why these two parameter required and what the use ?

Block Size :- dfs.block.size :

Inputsplitsize : while job running it takes.

Why we required two parameter in hadoop cluster ?

1 ACCEPTED SOLUTION

avatar
Master Guru
@zkfs

Block Size:

Physical Location where the data been stored i.e default size of the HDFS block is 128 MB which we can configure as per our requirement.

All blocks of the file are of the same size except the last block, which can be of same size or smaller.

The files are split into 128 MB blocks and then stored into Hadoop FileSystem.

in HDFS each file will be divided into blocks based on configuration of the size of block and Hadoop application will distributes those blocks across the cluster.

The main aim of splitting the file and storing them across the cluster is to get more parallelism and replication factor is helpful to get fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network.

Input Split:-

Logical representation of Block or more/lesser than a Block size

It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data.

During MapReduce execution, Hadoop scans through the blocks and create InputSplits and each inputSplit will be assigned to individual mappers for processing. Split act as a broker between block and mapper.

Let's take If we are have 1.2GB file divided into 10 blocks i.e each block is almost 128 MB.

InputFormat.getSplits() is responsible for generating the input splits which are going to be used each split as input for each mapper. By default this class is going to create one input split for each HDFS block.

  1. if input split is not specified and start and end positions of records are in the same block,then HDFS block size will be split size then 10 mappers are initialized to load the file, each mapper loads one block.
  2. If the start and end positions of the records are not in the same block, this is the exact problem that input splits solve, Input split is going to provide the Start and end positions(offsets) of the records to make sure split having complete record as key/value pairs to the mappers, then mapper is going to load the block of data according to start and end offset values.
  3. If we specify split size is false then whole file will form one input split and processed by one map which it takes more time to process when file is big.
  4. If your resource is limited and you want to limit the number of maps then you can mention Split size as 256 MB then then logical grouping of 256 MB is formed and only 5 maps will be executed with a size of 256 MB.

View solution in original post

2 REPLIES 2

avatar
Master Guru
@zkfs

Block Size:

Physical Location where the data been stored i.e default size of the HDFS block is 128 MB which we can configure as per our requirement.

All blocks of the file are of the same size except the last block, which can be of same size or smaller.

The files are split into 128 MB blocks and then stored into Hadoop FileSystem.

in HDFS each file will be divided into blocks based on configuration of the size of block and Hadoop application will distributes those blocks across the cluster.

The main aim of splitting the file and storing them across the cluster is to get more parallelism and replication factor is helpful to get fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network.

Input Split:-

Logical representation of Block or more/lesser than a Block size

It is used during data processing in MapReduce program or other processing techniques. InputSplit doesn’t contain actual data, but a reference to the data.

During MapReduce execution, Hadoop scans through the blocks and create InputSplits and each inputSplit will be assigned to individual mappers for processing. Split act as a broker between block and mapper.

Let's take If we are have 1.2GB file divided into 10 blocks i.e each block is almost 128 MB.

InputFormat.getSplits() is responsible for generating the input splits which are going to be used each split as input for each mapper. By default this class is going to create one input split for each HDFS block.

  1. if input split is not specified and start and end positions of records are in the same block,then HDFS block size will be split size then 10 mappers are initialized to load the file, each mapper loads one block.
  2. If the start and end positions of the records are not in the same block, this is the exact problem that input splits solve, Input split is going to provide the Start and end positions(offsets) of the records to make sure split having complete record as key/value pairs to the mappers, then mapper is going to load the block of data according to start and end offset values.
  3. If we specify split size is false then whole file will form one input split and processed by one map which it takes more time to process when file is big.
  4. If your resource is limited and you want to limit the number of maps then you can mention Split size as 256 MB then then logical grouping of 256 MB is formed and only 5 maps will be executed with a size of 256 MB.

avatar
Expert Contributor

your comments are appreciated, thanks you.

as you mentioned and in addition we can change the input splits size according to our requirement by using the the below parameters.

MAPRED.MAX.SPLIT.SIZE :- If we want to increase the inputsplit size ,use this parameter while running the job.
DFS.BLOCK.SIZE        :- Is global HDFS block size parameter, while storing the data in cluster  .