Created 03-23-2017 06:47 AM
We have external tables on AWS S3 buckets in CSV format not compressed, when we try to query tables with simple Select * from example_table Limit 10 or Where serial = “SomeID” it takes time of minimum of 30 secs and consumes complete resources which are available in RM before the final display of output where table data is less which is approximate of 500 to 1000 records and there are also very large tables with 3 million records which displays even faster.
Also one of the table with just 8331 records and 19 columns takes to 5-6 mins to complete Count Clause. Initiating itself takes 2-3 mins and once after initiated it completes quickly, this happens only with this table! I have changed the Execution engine for this table to MR which initiated quickly and completed in 80 Secs.
I do not understand the TEZ execution plan if someone would be able to help me out would be appreciated!
We have 3 node cluster built on ec2 installed HDP 2.5.2 and Hive 1.2.100. Out of which 2 are Datanodes, RM resources are 24 Vcores and 108 GB RAM.
Created 03-26-2017 04:20 PM
Thanks for the reply folks. I have found the issue ! When we are importing the data from legacy DB servers using Spark, during the Spark execution, Hive staging files are created in target location where data resides. When we export these data to S3 using disctp, these hive staging also moves to that bucket. So when we query these using hive, it seems to be checking all those hive staging files before throwing the o/p and also number of splits matters which are more in number, I have merged these splits together to have less mappers and to get better performance which is achieved now. I get the count of the 3 million records table in fraction of seconds!
Created 03-23-2017 03:49 PM
If I understand correctly, you say you have large tables (3 million records) return a query like this relatively fast:
Select * from example_table Limit 10 or Where serial = “SomeID”
but when you run similar query against an external table stored on AWS S3, it performs badly.
Did you try to copy table data file to hdfs, and then create an external table on the hdfs file? I bet that could make a big difference in the performance.
I assume the difference is because in case the table data is stored on S3, hive first needs to copy the data from S3 onto a node where hive runs and the speed of that operation will depend on network bandwidth available.
Created 03-26-2017 04:12 PM
Thanks a lot for the reply ! We have a On Prem environment and tables are external in HDFS and are running extremely fast! We move this table data to S3 later we query this from another environment which I have mentioned above. Once this is confirmed by a Team we move this data to AWS Redshift from S3. Anyways I have found the issue for timeout errors which I'll be posting below.
Created 03-23-2017 06:32 PM
Also, here are some tips on how to improve performance http://hortonworks.github.io/hdp-aws/s3-hive/index.html#improving-performance-for-hive-jobs
Created 03-26-2017 04:12 PM
Thanks for the reply ! Bookmarked the link !
Created 03-26-2017 04:20 PM
Thanks for the reply folks. I have found the issue ! When we are importing the data from legacy DB servers using Spark, during the Spark execution, Hive staging files are created in target location where data resides. When we export these data to S3 using disctp, these hive staging also moves to that bucket. So when we query these using hive, it seems to be checking all those hive staging files before throwing the o/p and also number of splits matters which are more in number, I have merged these splits together to have less mappers and to get better performance which is achieved now. I get the count of the 3 million records table in fraction of seconds!