<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140165#M27844</link>
    <description>&lt;P&gt;Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.&lt;/P&gt;</description>
    <pubDate>Tue, 10 May 2016 19:19:51 GMT</pubDate>
    <dc:creator>issaq_mohd</dc:creator>
    <dc:date>2016-05-10T19:19:51Z</dc:date>
    <item>
      <title>How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140162#M27841</link>
      <description />
      <pubDate>Tue, 10 May 2016 18:13:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140162#M27841</guid>
      <dc:creator>issaq_mohd</dc:creator>
      <dc:date>2016-05-10T18:13:03Z</dc:date>
    </item>
    <item>
      <title>Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140163#M27842</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10357/issaq-mohd.html" nodeid="10357" target="_blank"&gt;@Issaq Mohammad&lt;/A&gt;  If replication factor is not 1 then data will be distributed across the different nodes. See the following details : &lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="4131-name-node.png" style="width: 900px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/21790iCBE939FCFF98EDEC/image-size/medium?v=v2&amp;amp;px=400" role="button" title="4131-name-node.png" alt="4131-name-node.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Aug 2019 08:23:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140163#M27842</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2019-08-19T08:23:11Z</dc:date>
    </item>
    <item>
      <title>Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140164#M27843</link>
      <description>&lt;P&gt;In addition to what Neeraj said Data will be cut into blocks and distributed but perhaps more relevant you will have a SINGLE mapper reading that file ( and piecing it back together).&lt;/P&gt;&lt;P&gt;This is true for GZ for example which is a so-called "non-splittable" compression format. Which means a map task cannot read a single block but essentially needs to read the full file from the start. &lt;/P&gt;&lt;P&gt;So rule of thumb is: if you have GZ compressed files ( which is perfectly fine and often used ) make sure they are not big. Be aware that each of them will be read by a single map task. Depending on compression ratio and performance SLAs you want to be below 128MB. &lt;/P&gt;&lt;P&gt;There are other "splittable" compression algorithms supported ( mainly LZO ) in case you cannot guarantee that. And some native formats like HBase HFiles, Hive ORC files, ... support compression inherently mostly compressing internal blocks or fields. &lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 18:51:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140164#M27843</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-05-10T18:51:39Z</dc:date>
    </item>
    <item>
      <title>Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140165#M27844</link>
      <description>&lt;P&gt;Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 19:19:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140165#M27844</guid>
      <dc:creator>issaq_mohd</dc:creator>
      <dc:date>2016-05-10T19:19:51Z</dc:date>
    </item>
    <item>
      <title>Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140166#M27845</link>
      <description>&lt;P&gt;Thanks all for the reply appreciate it. Is it possible to use a single mapper to read the compressed file and apply codec mechanism to distribute the data across nodes. Please let me know.&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 19:19:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140166#M27845</guid>
      <dc:creator>issaq_mohd</dc:creator>
      <dc:date>2016-05-10T19:19:54Z</dc:date>
    </item>
    <item>
      <title>Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140167#M27846</link>
      <description>&lt;P&gt;Not exactly sure what you try to say with "codec mechanism" But if you mean if you could transform a single big GZ file into small gz files or into uncompressed files you would most likely use pig:&lt;/P&gt;&lt;P&gt;&lt;A href="http://stackoverflow.com/questions/4968843/how-do-i-store-gzipped-files-using-pigstorage-in-apache-pig"&gt;http://stackoverflow.com/questions/4968843/how-do-i-store-gzipped-files-using-pigstorage-in-apache-pig&lt;/A&gt;&lt;/P&gt;&lt;P&gt;To specify a number of writers you will need to force reducers. &lt;/P&gt;&lt;P&gt;&lt;A href="http://stackoverflow.com/questions/19789642/how-do-i-force-pigstorage-to-output-a-few-large-files-instead-of-thousands-of-ti"&gt;http://stackoverflow.com/questions/19789642/how-do-i-force-pigstorage-to-output-a-few-large-files-instead-of-thousands-of-ti&lt;/A&gt;&lt;/P&gt;&lt;P&gt;And here are some tips on setting the number of reducers:&lt;/P&gt;&lt;P&gt;&lt;A href="http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features"&gt;http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Instead of pig you could also write a small MapReduce job, here you are more flexible for the price of a bit of coding. Or Spark might work too. Or Hive using the DISTRIBUTE BY keyword. &lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 19:32:58 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140167#M27846</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-05-10T19:32:58Z</dc:date>
    </item>
    <item>
      <title>Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140168#M27847</link>
      <description>&lt;P&gt;You can achieve this by reading "non-splittable" compressed format in Single Mapper and then distributing data using Reducer to multiple nodes. &lt;/P&gt;&lt;P&gt;HDFS will store data on multiple node even if files are compressed (using non-splittable or splittable codec) .HDFS will split the compressed file based on the block size. While reading file back in a MR job , your MR job will have a single mapper if your file is compressed using non-splittable codec otherwise (splittable codec) MR Job will have multiple mapper to read data.&lt;/P&gt;&lt;P&gt;How Data is distributed :&lt;/P&gt;&lt;P&gt;Suppose you have 1024MB of compressed file and your Hadoop cluster have 128MB of block size.&lt;/P&gt;&lt;P&gt;When you upload the compressed file to HDFS , it will get converted into 8blocks (128MB each block size) and distributed to different nodes of cluster. HDFS would take care about which node should receive block in a cluster depending on cluster health/ node health/ HDFS balance.&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 19:46:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140168#M27847</guid>
      <dc:creator>pradeep_bhadani</dc:creator>
      <dc:date>2016-05-10T19:46:01Z</dc:date>
    </item>
    <item>
      <title>Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140169#M27848</link>
      <description>&lt;P&gt;Hello &lt;A rel="user" href="https://community.cloudera.com/users/10357/issaq-mohd.html" nodeid="10357"&gt;@Issaq Mohammad&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;Here are some useful posts on file formats:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;A href="http://www.semantikoz.com/blog/getting-started-with-hadoop-and-big-data-with-text-and-hive/"&gt;Getting started with Text and Apache Hive&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="http://www.semantikoz.com/blog/optimising-hadoop-big-data-text-hive/"&gt;Optimising Hadoop with Text and Hive&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="http://www.semantikoz.com/blog/faster-big-data-hadoop-hive-rcfile/"&gt;Faster Big Data on Hadoop with Hive and RCFile&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="http://www.semantikoz.com/blog/optimising-hadoop-big-data-text-hive/"&gt; Optimising Hadoop with Text and Hive&lt;/A&gt;
&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I hope that helps you to navigate the space a bit better.&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2016 20:48:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140169#M27848</guid>
      <dc:creator>christian_proko</dc:creator>
      <dc:date>2016-05-10T20:48:06Z</dc:date>
    </item>
    <item>
      <title>Re: How a huge compressed file will get stored in HDFS system? Is the data distriuted across different nodes or it get stores on a single node?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140170#M27849</link>
      <description>&lt;P&gt;Here is a great writeup on file compression in Hadoop - &lt;A href="http://comphadoop.weebly.com/" target="_blank"&gt;http://comphadoop.weebly.com/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 11 May 2016 00:11:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-a-huge-compressed-file-will-get-stored-in-HDFS-system-Is/m-p/140170#M27849</guid>
      <dc:creator>SQLShaw</dc:creator>
      <dc:date>2016-05-11T00:11:43Z</dc:date>
    </item>
  </channel>
</rss>

