<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: ​Hive Performance issues with Tez Engine for External Tables in S3 in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171483#M57898</link>
    <description>&lt;P&gt;Thanks for the reply folks. I have found the issue ! When we are importing the data from legacy DB servers using Spark, during the Spark execution, Hive staging files are created in target location where data resides. When we export these data to S3 using disctp, these hive staging also moves to that bucket. So when we query these using hive, it seems to be checking all those hive staging files before throwing the o/p and also number of splits matters which are more in number, I have merged these splits together to have less mappers and to get better performance which is achieved now. I get the count of the 3 million records table in fraction of seconds! &lt;/P&gt;</description>
    <pubDate>Sun, 26 Mar 2017 23:20:51 GMT</pubDate>
    <dc:creator>ramcharantej</dc:creator>
    <dc:date>2017-03-26T23:20:51Z</dc:date>
    <item>
      <title>​Hive Performance issues with Tez Engine for External Tables in S3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171478#M57893</link>
      <description>&lt;P&gt;We
have external tables on AWS S3 buckets in CSV format not compressed, when we
try to query tables with simple Select * from example_table Limit 10 or Where
serial = “SomeID”  it takes time of
minimum of 30 secs and consumes complete resources which are available in RM before
the final display of output where table data is less which is approximate of
500 to 1000 records and there are also very large tables with 3 million records
which displays even faster. &lt;/P&gt;&lt;P&gt;Also one of the table with just 8331 records and 19
columns takes to 5-6 mins to complete Count Clause. Initiating itself takes 2-3 mins and once after initiated it completes quickly, this happens only with this table! I have changed the Execution engine for this table to MR which initiated quickly and completed in 80 Secs. &lt;/P&gt;&lt;P&gt;I do not understand the TEZ
execution plan if someone would be able to help me out would be appreciated!&lt;/P&gt;&lt;P&gt;We
have 3 node cluster built on ec2 installed HDP 2.5.2 and Hive 1.2.100. Out of which 2 are Datanodes, RM resources are 24 Vcores
and 108 GB RAM. &lt;/P&gt;</description>
      <pubDate>Thu, 23 Mar 2017 13:47:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171478#M57893</guid>
      <dc:creator>ramcharantej</dc:creator>
      <dc:date>2017-03-23T13:47:40Z</dc:date>
    </item>
    <item>
      <title>Re: ​Hive Performance issues with Tez Engine for External Tables in S3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171479#M57894</link>
      <description>&lt;P&gt;If I understand correctly, you say you have large tables (3 million records) return a query like this relatively fast:&lt;/P&gt;&lt;PRE&gt;Select * from example_table Limit 10 or Where serial = “SomeID”&lt;/PRE&gt;&lt;P&gt;but when you run similar query against an external table stored on AWS S3, it performs badly.&lt;/P&gt;&lt;P&gt;Did you try to copy table data file to hdfs, and then create an external table on the hdfs file? I bet that could make a big difference in the performance. &lt;/P&gt;&lt;P&gt;I assume the difference is because in case the table data is stored on S3, hive first needs to copy the data from S3 onto a node where hive runs and the speed of that operation will depend on network bandwidth available.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Mar 2017 22:49:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171479#M57894</guid>
      <dc:creator>bpgergo</dc:creator>
      <dc:date>2017-03-23T22:49:51Z</dc:date>
    </item>
    <item>
      <title>Re: ​Hive Performance issues with Tez Engine for External Tables in S3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171480#M57895</link>
      <description>&lt;P&gt;Also, here are some tips on how to improve performance &lt;A href="http://hortonworks.github.io/hdp-aws/s3-hive/index.html#improving-performance-for-hive-jobs" target="_blank"&gt;http://hortonworks.github.io/hdp-aws/s3-hive/index.html#improving-performance-for-hive-jobs&lt;/A&gt; &lt;/P&gt;</description>
      <pubDate>Fri, 24 Mar 2017 01:32:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171480#M57895</guid>
      <dc:creator>Dominika</dc:creator>
      <dc:date>2017-03-24T01:32:55Z</dc:date>
    </item>
    <item>
      <title>Re: ​Hive Performance issues with Tez Engine for External Tables in S3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171481#M57896</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/12466/pbarna.html" nodeid="12466"&gt;@pbarna&lt;/A&gt;&lt;P&gt;Thanks a lot for the reply ! We have a On Prem environment and tables are external in HDFS and are running extremely fast! We move this table data to S3 later we query this from another environment which I have mentioned above. Once this is confirmed by a Team we move this data to AWS Redshift from S3. Anyways I have found the issue for timeout errors which I'll be posting below. &lt;/P&gt;</description>
      <pubDate>Sun, 26 Mar 2017 23:12:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171481#M57896</guid>
      <dc:creator>ramcharantej</dc:creator>
      <dc:date>2017-03-26T23:12:25Z</dc:date>
    </item>
    <item>
      <title>Re: ​Hive Performance issues with Tez Engine for External Tables in S3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171482#M57897</link>
      <description>&lt;P&gt;Thanks for the reply ! Bookmarked the link !&lt;/P&gt;</description>
      <pubDate>Sun, 26 Mar 2017 23:12:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171482#M57897</guid>
      <dc:creator>ramcharantej</dc:creator>
      <dc:date>2017-03-26T23:12:46Z</dc:date>
    </item>
    <item>
      <title>Re: ​Hive Performance issues with Tez Engine for External Tables in S3</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171483#M57898</link>
      <description>&lt;P&gt;Thanks for the reply folks. I have found the issue ! When we are importing the data from legacy DB servers using Spark, during the Spark execution, Hive staging files are created in target location where data resides. When we export these data to S3 using disctp, these hive staging also moves to that bucket. So when we query these using hive, it seems to be checking all those hive staging files before throwing the o/p and also number of splits matters which are more in number, I have merged these splits together to have less mappers and to get better performance which is achieved now. I get the count of the 3 million records table in fraction of seconds! &lt;/P&gt;</description>
      <pubDate>Sun, 26 Mar 2017 23:20:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-Performance-issues-with-Tez-Engine-for-External-Tables/m-p/171483#M57898</guid>
      <dc:creator>ramcharantej</dc:creator>
      <dc:date>2017-03-26T23:20:51Z</dc:date>
    </item>
  </channel>
</rss>

