- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
how do I make that external hive table perform faster
- Labels:
-
Apache Hive
Created on 02-11-2020 11:35 AM - last edited on 02-11-2020 05:22 PM by ask_bill_brooks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
question... I get csv file I convert that into parquet in hdfs., copy that file to Hive table location....Hive table can see the file and can query the data successfully....
All is good... but question is how do I make that hive table perform faster ... empid is the is Pkey on the source.
Table is External Table...if I were to create Partition do I create on Empid (primary key) ? Do I have to create new table ... what are my options?
Created on 02-12-2020 05:29 AM - edited 02-12-2020 05:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@AndyTech Can you share some details about the hive table (schema, number rows, data size, etc)? Can you describe your hive setup (configuration, # of nodes, tez/yarn container size, queue setup, etc)? Can you speak to the current speed benchmark, and your expected speed?
Depending on the data itself, partition can have some performance. There are a bunch of things you can do to hive itself, that will have a huge impact on performance. With working table, and working query, especially parquet, I would want to investigate Hive Performance Tuning before making any changes to the data structure. Additionally, there would be some discussion about parquet vs orc which is known to be faster.
Share some details and myself or others will comment further.
Created 02-12-2020 05:43 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @stevenmatison
I am using Parquet format, I tried with ORC not a significant difference, then I changed following setting as follows: Not knowing a lot on the following settings but based on my research. I am not using partitions yet.
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.vectorized.execution = ture
set hive.vectorized.execution.enabled = true
also I changed following execution engine
set hive.execution.engine = spark
I think changing engine to spark made a lot of difference....
Now query is running from 2.48 min to 15 sec
I am quite satisfied with current performance but I would sure appreciate other advise for me and for the community.
Thanks and appreciate you response.
Andy