question... I get csv file I convert that into parquet in hdfs., copy that file to Hive table location....Hive table can see the file and can query the data successfully....
All is good... but question is how do I make that hive table perform faster ... empid is the is Pkey on the source.
Table is External Table...if I were to create Partition do I create on Empid (primary key) ? Do I have to create new table ... what are my options?
@AndyTech Can you share some details about the hive table (schema, number rows, data size, etc)? Can you describe your hive setup (configuration, # of nodes, tez/yarn container size, queue setup, etc)? Can you speak to the current speed benchmark, and your expected speed?
Depending on the data itself, partition can have some performance. There are a bunch of things you can do to hive itself, that will have a huge impact on performance. With working table, and working query, especially parquet, I would want to investigate Hive Performance Tuning before making any changes to the data structure. Additionally, there would be some discussion about parquet vs orc which is known to be faster.
Share some details and myself or others will comment further.
I am using Parquet format, I tried with ORC not a significant difference, then I changed following setting as follows: Not knowing a lot on the following settings but based on my research. I am not using partitions yet.
set hive.vectorized.execution = ture
set hive.vectorized.execution.enabled = true
also I changed following execution engine
set hive.execution.engine = spark
I think changing engine to spark made a lot of difference....
Now query is running from 2.48 min to 15 sec
I am quite satisfied with current performance but I would sure appreciate other advise for me and for the community.
Thanks and appreciate you response.