<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Difference between hadoop block Size and Input Splits in hadoop and why two parameter are there ? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Difference-between-hadoop-block-Size-and-Input-Splits-in/m-p/224161#M186025</link>
    <description>&lt;P&gt;your comments are appreciated, thanks you.&lt;/P&gt;&lt;P&gt;as you mentioned and in addition  we can change the input splits size according to our requirement by using the the below parameters.&lt;/P&gt;&lt;PRE&gt;MAPRED.MAX.SPLIT.SIZE :- If we want to increase the inputsplit size ,use this parameter while running the job.
DFS.BLOCK.SIZE        :- Is global HDFS block size parameter, while storing the data in cluster  .
&lt;BR /&gt;&lt;/PRE&gt;</description>
    <pubDate>Sun, 17 Dec 2017 16:04:10 GMT</pubDate>
    <dc:creator>shivkumar82015</dc:creator>
    <dc:date>2017-12-17T16:04:10Z</dc:date>
    <item>
      <title>Difference between hadoop block Size and Input Splits in hadoop and why two parameter are there ?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Difference-between-hadoop-block-Size-and-Input-Splits-in/m-p/224159#M186023</link>
      <description>&lt;P&gt;We have inputsplit parameter and block-size is hadoop, why these two parameter required and what  the use ?&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Block Size&lt;/STRONG&gt;        :- dfs.&lt;STRONG&gt;block&lt;/STRONG&gt;.&lt;STRONG&gt;size :&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Inputsplitsize  : while job running it takes.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Why we required two parameter in hadoop cluster ?&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Dec 2017 20:24:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Difference-between-hadoop-block-Size-and-Input-Splits-in/m-p/224159#M186023</guid>
      <dc:creator>shivkumar82015</dc:creator>
      <dc:date>2017-12-13T20:24:41Z</dc:date>
    </item>
    <item>
      <title>Re: Difference between hadoop block Size and Input Splits in hadoop and why two parameter are there ?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Difference-between-hadoop-block-Size-and-Input-Splits-in/m-p/224160#M186024</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/11907/shivkumar82015.html" nodeid="11907"&gt;@zkfs&lt;/A&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;&lt;/U&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;U&gt;Block Size:&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;&lt;/U&gt;Physical Location&lt;/STRONG&gt; where the data been stored i.e default
size of the HDFS block is 128 MB which we can configure as per our requirement. &lt;/P&gt;&lt;P&gt;All blocks of the file are of the &lt;STRONG&gt;same size except the last block&lt;/STRONG&gt;, which can be
of same size or smaller. &lt;/P&gt;&lt;P&gt;The files are split into &lt;STRONG&gt;128 MB blocks&lt;/STRONG&gt; and then stored
into &lt;STRONG&gt;Hadoop FileSystem.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;in HDFS each file will be divided into blocks based on
configuration of the size of block and Hadoop application will distributes
those blocks across the cluster.&lt;/P&gt;&lt;P&gt;The main aim of &lt;STRONG&gt;splitting&lt;/STRONG&gt; the file and storing them across
the cluster is to get &lt;STRONG&gt;more parallelism &lt;/STRONG&gt;and replication factor is helpful to get
&lt;STRONG&gt;fault tolerance&lt;/STRONG&gt;, but it also helps in running your &lt;STRONG&gt;map tasks close&lt;/STRONG&gt; to the data
to avoid putting extra load on the network.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Input Split:- &lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;&lt;/U&gt;Logical representation of Block&lt;/STRONG&gt; or more/lesser
than a Block size&lt;/P&gt;&lt;P&gt;It is &lt;STRONG&gt;used&lt;/STRONG&gt; during &lt;STRONG&gt;data processing&lt;/STRONG&gt; in &lt;STRONG&gt;MapReduce&lt;/STRONG&gt; program or
other processing techniques. InputSplit doesn’t contain actual data, but a
reference to the data.&lt;/P&gt;&lt;P&gt;During MapReduce execution, Hadoop scans through the blocks
and create InputSplits and each inputSplit will be assigned to individual
mappers for processing. &lt;STRONG&gt;Split act as a broker between block and mapper&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;Let's take If we are have &lt;STRONG&gt;1.2GB file&lt;/STRONG&gt; divided into 10 blocks
i.e each block is almost 128 MB.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt; InputFormat.getSplits() &lt;/STRONG&gt;is
responsible for generating the input splits which are going to be used each
split as input for each mapper. By default this class is going to create
one input split for each HDFS block.&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;B&gt;if input split is not specified &lt;/B&gt;and start and end
positions of records are in the same block,then&lt;STRONG&gt; HDFS block size
will be split size&lt;/STRONG&gt; then &lt;STRONG&gt;10 mappers are initialized&lt;/STRONG&gt; to load the file, &lt;STRONG&gt;each
mapper loads one block&lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;If the start and end positions of the records are not in the
same block&lt;/B&gt;, this is the exact problem that &lt;STRONG&gt;input
splits&lt;/STRONG&gt; solve, Input split is going to provide the Start and end positions(offsets) of the records to make sure split having complete record as
key/value pairs to the mappers, then mapper is going to load the block of data according to start and end offset values.&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;If we specify split size is false &lt;/B&gt;then whole file will form
&lt;B&gt;one input split&lt;/B&gt; and processed by one map which it takes more time to process
when file is big.&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;If your resource is limited and you want to limit the number
of maps &lt;/B&gt;then you can mention Split size as 256 MB then then logical grouping of 256 MB
is formed and only 5 maps will be executed with a size of 256 MB.&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Fri, 15 Dec 2017 12:35:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Difference-between-hadoop-block-Size-and-Input-Splits-in/m-p/224160#M186024</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2017-12-15T12:35:50Z</dc:date>
    </item>
    <item>
      <title>Re: Difference between hadoop block Size and Input Splits in hadoop and why two parameter are there ?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Difference-between-hadoop-block-Size-and-Input-Splits-in/m-p/224161#M186025</link>
      <description>&lt;P&gt;your comments are appreciated, thanks you.&lt;/P&gt;&lt;P&gt;as you mentioned and in addition  we can change the input splits size according to our requirement by using the the below parameters.&lt;/P&gt;&lt;PRE&gt;MAPRED.MAX.SPLIT.SIZE :- If we want to increase the inputsplit size ,use this parameter while running the job.
DFS.BLOCK.SIZE        :- Is global HDFS block size parameter, while storing the data in cluster  .
&lt;BR /&gt;&lt;/PRE&gt;</description>
      <pubDate>Sun, 17 Dec 2017 16:04:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Difference-between-hadoop-block-Size-and-Input-Splits-in/m-p/224161#M186025</guid>
      <dc:creator>shivkumar82015</dc:creator>
      <dc:date>2017-12-17T16:04:10Z</dc:date>
    </item>
  </channel>
</rss>

