<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: what is huge datasets for Hive ? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/what-is-huge-datasets-for-Hive/m-p/188266#M70636</link>
    <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/45257/sebastienfrackowiak.html" nodeid="45257"&gt;@Sebastien F&lt;/A&gt; Hive has been documented at running on 300+ PB of raw storage at Facebook. The largest cluster is 4,500+ nodes at Yahoo. Yahoo Japan was able to run 100,000 queries per hour and LLAP ran 100 million rows/s per node.&lt;/P&gt;&lt;P&gt;Hive\Tez scales to 100's of PB. LLAP is meant for smaller data sets (1-10 TB) which are typical for standard BI type workloads. With that being said, LLAP allows you to utilize SSD for cache so you can extend this to 100's TB (if you can afford that much SSD storage).&lt;/P&gt;&lt;P&gt;Hope this helps!&lt;/P&gt;</description>
    <pubDate>Fri, 03 Nov 2017 00:29:28 GMT</pubDate>
    <dc:creator>SQLShaw</dc:creator>
    <dc:date>2017-11-03T00:29:28Z</dc:date>
    <item>
      <title>what is huge datasets for Hive ?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/what-is-huge-datasets-for-Hive/m-p/188264#M70634</link>
      <description>&lt;P&gt;I read lot of articles advising about fastest solutions to compute datasets.&lt;/P&gt;&lt;P&gt;I saw that Hive / TEZ is 100x faster than Hive / MapReduce, but Spark
 is 100x faster than Hive (TEZ or MR not mentionned ;-)), and finally, 
"it depends if you compute huge datasets or not".&lt;/P&gt;&lt;P&gt;My first question is : from what size can I consider a "huge" 
datasets ? I presume the number of rows and columns is significant...&lt;/P&gt;&lt;P&gt;My second question is : what if I am querying few partitions from a 
large dataset ? I think it comes to querying a small dataset ?&lt;/P&gt;</description>
      <pubDate>Thu, 02 Nov 2017 22:42:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/what-is-huge-datasets-for-Hive/m-p/188264#M70634</guid>
      <dc:creator>sebastien_frack</dc:creator>
      <dc:date>2017-11-02T22:42:34Z</dc:date>
    </item>
    <item>
      <title>Re: what is huge datasets for Hive ?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/what-is-huge-datasets-for-Hive/m-p/188265#M70635</link>
      <description>&lt;P&gt;...and, I always wondered how benchmarks are performed, is it just a timing of an execution on a "clear" plateform ?&lt;/P&gt;</description>
      <pubDate>Thu, 02 Nov 2017 23:04:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/what-is-huge-datasets-for-Hive/m-p/188265#M70635</guid>
      <dc:creator>sebastien_frack</dc:creator>
      <dc:date>2017-11-02T23:04:19Z</dc:date>
    </item>
    <item>
      <title>Re: what is huge datasets for Hive ?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/what-is-huge-datasets-for-Hive/m-p/188266#M70636</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/45257/sebastienfrackowiak.html" nodeid="45257"&gt;@Sebastien F&lt;/A&gt; Hive has been documented at running on 300+ PB of raw storage at Facebook. The largest cluster is 4,500+ nodes at Yahoo. Yahoo Japan was able to run 100,000 queries per hour and LLAP ran 100 million rows/s per node.&lt;/P&gt;&lt;P&gt;Hive\Tez scales to 100's of PB. LLAP is meant for smaller data sets (1-10 TB) which are typical for standard BI type workloads. With that being said, LLAP allows you to utilize SSD for cache so you can extend this to 100's TB (if you can afford that much SSD storage).&lt;/P&gt;&lt;P&gt;Hope this helps!&lt;/P&gt;</description>
      <pubDate>Fri, 03 Nov 2017 00:29:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/what-is-huge-datasets-for-Hive/m-p/188266#M70636</guid>
      <dc:creator>SQLShaw</dc:creator>
      <dc:date>2017-11-03T00:29:28Z</dc:date>
    </item>
    <item>
      <title>Re: what is huge datasets for Hive ?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/what-is-huge-datasets-for-Hive/m-p/188267#M70637</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/186/sshaw.html" nodeid="186"&gt;@Scott Shaw&lt;/A&gt; ; it helps &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; thanks a lot.&lt;/P&gt;</description>
      <pubDate>Fri, 03 Nov 2017 15:03:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/what-is-huge-datasets-for-Hive/m-p/188267#M70637</guid>
      <dc:creator>sebastien_frack</dc:creator>
      <dc:date>2017-11-03T15:03:36Z</dc:date>
    </item>
  </channel>
</rss>

