Created 03-05-2018 04:28 AM
My Hive table is in ORC format and queries in it run fastest when columns in where clause are sorted. But in my case there are not currently. What is the syntax to sort a column just before query?
Created 03-05-2018 04:52 AM
If I understand your question properly, you have an unsorted ORC table. And you want to query that table but want to "sort" the data "before" querying! This does not make any sense since you would be firing some "query" to have sorted data to fire another query on top of it.
Sort can be a costly operation depending on how you implement it. However, there are a bunch of other options that you can use while querying the data which can speed up your queries. Follows some details.
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
Partition and column statistics from fetched from the metastsore. Use this with caution. If you have too many partitions and/or columns, this could degrade performance.
SET hive.tez.auto.reducer.parallelism=true;
SET hive.tez.max.partition.factor=20;
SET hive.exec.reducers.bytes.per.reducer=128000000;
Also, you may want to look at the best practices to create ORC tables, mentioned here , so that you can have the maximum of your queries in the minimum of time!
Hope that helps!
Created 03-05-2018 04:52 AM
If I understand your question properly, you have an unsorted ORC table. And you want to query that table but want to "sort" the data "before" querying! This does not make any sense since you would be firing some "query" to have sorted data to fire another query on top of it.
Sort can be a costly operation depending on how you implement it. However, there are a bunch of other options that you can use while querying the data which can speed up your queries. Follows some details.
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
Partition and column statistics from fetched from the metastsore. Use this with caution. If you have too many partitions and/or columns, this could degrade performance.
SET hive.tez.auto.reducer.parallelism=true;
SET hive.tez.max.partition.factor=20;
SET hive.exec.reducers.bytes.per.reducer=128000000;
Also, you may want to look at the best practices to create ORC tables, mentioned here , so that you can have the maximum of your queries in the minimum of time!
Hope that helps!
Created 03-05-2018 06:09 AM
Will usage of DISTRIBUTE BY or SORT BY be helpful?