<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question HIVE Best Practice in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HIVE-Best-Practice/m-p/145920#M28211</link>
    <description>&lt;P&gt;Hi Team &lt;/P&gt;&lt;P&gt;Can you help to under Stand HIVE best practices on Horton works HDP 2.3, to support better  &lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;HIVE Best Practices&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;1.What  is the max joins that we can used in Hive for best
performance ? what is the limitation of using joins ? what happen if we use
multiple joins (will it affect performance or Job fail )?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;2.While  Querying  what kind of fields should be used for
join keys?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;3.How will you make use of Partitioning and bucketing &lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;4.Criticality of type casting ? Converting the data types on fly
over the queries ?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;5.Using multiple casting will it affect the HIVE job performance ?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;6.How to avoid using multiple Internal Joins, any alternative that
we can use of avoiding multiple joins?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;7.What is the best way of doing splitting &lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;8.when to use left outer join and right outer join to avoid full
table scan.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;9.What is best way to use select  query instead of scanning
full table &lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;10.Map join optimization ? when to use Map joins ?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;11.SKEW join optimization ? when to use SKEW joins?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;12.SMB join optimization? When to go SMP joins ?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;13.During huge data process what needs to do to  prevent from
job failures ? what is the best practices in that scenario ?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;14.Advantage of De-Normalization and where should I use on HIVE &lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Fri, 13 May 2016 16:06:40 GMT</pubDate>
    <dc:creator>suresh_b_k</dc:creator>
    <dc:date>2016-05-13T16:06:40Z</dc:date>
    <item>
      <title>HIVE Best Practice</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HIVE-Best-Practice/m-p/145920#M28211</link>
      <description>&lt;P&gt;Hi Team &lt;/P&gt;&lt;P&gt;Can you help to under Stand HIVE best practices on Horton works HDP 2.3, to support better  &lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;HIVE Best Practices&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;1.What  is the max joins that we can used in Hive for best
performance ? what is the limitation of using joins ? what happen if we use
multiple joins (will it affect performance or Job fail )?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;2.While  Querying  what kind of fields should be used for
join keys?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;3.How will you make use of Partitioning and bucketing &lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;4.Criticality of type casting ? Converting the data types on fly
over the queries ?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;5.Using multiple casting will it affect the HIVE job performance ?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;6.How to avoid using multiple Internal Joins, any alternative that
we can use of avoiding multiple joins?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;7.What is the best way of doing splitting &lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;8.when to use left outer join and right outer join to avoid full
table scan.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;9.What is best way to use select  query instead of scanning
full table &lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;10.Map join optimization ? when to use Map joins ?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;11.SKEW join optimization ? when to use SKEW joins?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;12.SMB join optimization? When to go SMP joins ?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;13.During huge data process what needs to do to  prevent from
job failures ? what is the best practices in that scenario ?&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;14.Advantage of De-Normalization and where should I use on HIVE &lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Fri, 13 May 2016 16:06:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/HIVE-Best-Practice/m-p/145920#M28211</guid>
      <dc:creator>suresh_b_k</dc:creator>
      <dc:date>2016-05-13T16:06:40Z</dc:date>
    </item>
    <item>
      <title>Re: HIVE Best Practice</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HIVE-Best-Practice/m-p/145921#M28212</link>
      <description>&lt;P&gt;Those are a lot of (broad) questions!&lt;/P&gt;&lt;P&gt;I would recommend you in the first place to look at the "Hive performance tuning" documentation on our website: &lt;A href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_performance_tuning/content/ch_hive_architectural_overview.html"&gt;http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_performance_tuning/content/ch_hive_architectural_overview.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I guess you could also find some answers on this forum &lt;A href="https://community.hortonworks.com/topics/Hive.html"&gt;https://community.hortonworks.com/topics/Hive.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;However, due to the number of questions you have, I would recommend you to contact Hortonworks's professional services to have a consultant help you on your specific implementation (there is no "universal holy grail" tuning, at the end configurations and queries are optimised for specific use cases).&lt;/P&gt;</description>
      <pubDate>Fri, 13 May 2016 16:42:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/HIVE-Best-Practice/m-p/145921#M28212</guid>
      <dc:creator>sluangsay</dc:creator>
      <dc:date>2016-05-13T16:42:26Z</dc:date>
    </item>
    <item>
      <title>Re: HIVE Best Practice</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HIVE-Best-Practice/m-p/145922#M28213</link>
      <description>&lt;P&gt;Thanks for your information, Alt east can you tell me the advantage of SKEW joins and where to use ? and Instead of using multiple joins what is the best way to run the qurey &lt;/P&gt;</description>
      <pubDate>Fri, 13 May 2016 16:46:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/HIVE-Best-Practice/m-p/145922#M28213</guid>
      <dc:creator>suresh_b_k</dc:creator>
      <dc:date>2016-05-13T16:46:39Z</dc:date>
    </item>
    <item>
      <title>Re: HIVE Best Practice</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HIVE-Best-Practice/m-p/145923#M28214</link>
      <description>&lt;UL&gt;&lt;LI&gt;1.What is the max joins that we can used in Hive for best performance ? what is the limitation of using joins ? what happen if we use multiple joins (will it affect performance or Job fail )?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;There is no max join. By now Hive has a good cost based optimizer with statistics. So as long as you properly run statistics on the table you can have complex queries as well. However denormalized tables are cheaper ( storage is cheap ) so they make more sense than in traditional databases. But as sourygna said very general question.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;2.While Querying what kind of fields should be used for join keys?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;As in any database Integer keys are the best. Strings work but may require more memory. If you use floats you get what you deserve :-). &lt;/P&gt;&lt;UL&gt;&lt;LI&gt;3.How will you make use of Partitioning and bucketing&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;A href="http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data" target="_blank"&gt;http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data&lt;/A&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;4.Criticality of type casting ? Converting the data types on fly over the queries ?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Better if you don't do it. ORC files are optimized for each datatype so using strings and cast them on demand will slow performance. For delimited files much less important.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;5.Using multiple casting will it affect the HIVE job performance ?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;See 4. Yes as long as you use ORC.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;6.How to avoid using multiple Internal Joins, any alternative that we can use of avoiding multiple joins?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Denormalization?&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;7.What is the best way of doing splitting&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Not sure I understand the question. If you use ORC you have per default 256MB blocks which have 64MB stripes. Good default. But if you want more map tasks you can reduce the block size.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;8.when to use left outer join and right outer join to avoid full table scan.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Very  generic question.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;9.What is best way to use select query instead of scanning full table&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Very generic question. Look at the presentation I linked for details on Predicate pushdown. Sort your data properly during insert. &lt;/P&gt;&lt;UL&gt;&lt;LI&gt;10.Map join optimization ? when to use Map joins ?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;When the small table fits easily into memory of a map task?&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;11.SKEW join optimization ? when to use SKEW joins&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;A href="https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization" target="_blank"&gt;https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization&lt;/A&gt; has details on when its good&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;12.SMB join optimization? When to go SMP joins ?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Seriously you should read the hive confluence page. In general I would trust the CBO.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;13.During huge data process what needs to do to prevent from job failures ? what is the best practices in that scenario ?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Problems I have seen where WAY too many partitions and small files in each partition. Too many splits result in problems. So you should make sure to properly load data into hive ( see my presentation) . Make sure the file sizes in your hive tables are proper. Also keep an eye out for reducer and mapper numbers to make sure they are in healthy range. If they aren't there is no fixed rule on why. &lt;/P&gt;&lt;UL&gt;&lt;LI&gt;14.Advantage of De-Normalization and where should I use on HIVE&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Less joins but more data space. &lt;/P&gt;&lt;P&gt;As Sourygna said, these are some veeery generic questions. You might have to drill down a bit into what you actually concretely want.&lt;/P&gt;</description>
      <pubDate>Fri, 13 May 2016 17:54:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/HIVE-Best-Practice/m-p/145923#M28214</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-05-13T17:54:11Z</dc:date>
    </item>
  </channel>
</rss>

