Support Questions

Find answers, ask questions, and share your expertise

Can we sort a column of a Hive table just before query?

avatar
Contributor

My Hive table is in ORC format and queries in it run fastest when columns in where clause are sorted. But in my case there are not currently. What is the syntax to sort a column just before query?

1 ACCEPTED SOLUTION

avatar

If I understand your question properly, you have an unsorted ORC table. And you want to query that table but want to "sort" the data "before" querying! This does not make any sense since you would be firing some "query" to have sorted data to fire another query on top of it.

Sort can be a costly operation depending on how you implement it. However, there are a bunch of other options that you can use while querying the data which can speed up your queries. Follows some details.

  1. Use Tez execution engine. It is way faster than traditional MR jobs launched by Hive.
  2. Enable predicate pushdown (PPD) to filter at the storage layer:

    SET hive.optimize.ppd=true;

    SET hive.optimize.ppd.storage=true

  3. Vectorized query execution processes data in batches of 1024 rows instead of one by one:

    SET hive.vectorized.execution.enabled=true;

    SET hive.vectorized.execution.reduce.enabled=true;

  4. Enable the Cost Based Optimizer (COB) for efficient query execution based on cost and fetch table statistics:

    SET hive.cbo.enable=true;

    SET hive.compute.query.using.stats=true;

    SET hive.stats.fetch.column.stats=true;

    SET hive.stats.fetch.partition.stats=true;

    Partition and column statistics from fetched from the metastsore. Use this with caution. If you have too many partitions and/or columns, this could degrade performance.

  5. Control reducer output:

    SET hive.tez.auto.reducer.parallelism=true;

    SET hive.tez.max.partition.factor=20;

    SET hive.exec.reducers.bytes.per.reducer=128000000;

Also, you may want to look at the best practices to create ORC tables, mentioned here , so that you can have the maximum of your queries in the minimum of time!

Hope that helps!

View solution in original post

2 REPLIES 2

avatar

If I understand your question properly, you have an unsorted ORC table. And you want to query that table but want to "sort" the data "before" querying! This does not make any sense since you would be firing some "query" to have sorted data to fire another query on top of it.

Sort can be a costly operation depending on how you implement it. However, there are a bunch of other options that you can use while querying the data which can speed up your queries. Follows some details.

  1. Use Tez execution engine. It is way faster than traditional MR jobs launched by Hive.
  2. Enable predicate pushdown (PPD) to filter at the storage layer:

    SET hive.optimize.ppd=true;

    SET hive.optimize.ppd.storage=true

  3. Vectorized query execution processes data in batches of 1024 rows instead of one by one:

    SET hive.vectorized.execution.enabled=true;

    SET hive.vectorized.execution.reduce.enabled=true;

  4. Enable the Cost Based Optimizer (COB) for efficient query execution based on cost and fetch table statistics:

    SET hive.cbo.enable=true;

    SET hive.compute.query.using.stats=true;

    SET hive.stats.fetch.column.stats=true;

    SET hive.stats.fetch.partition.stats=true;

    Partition and column statistics from fetched from the metastsore. Use this with caution. If you have too many partitions and/or columns, this could degrade performance.

  5. Control reducer output:

    SET hive.tez.auto.reducer.parallelism=true;

    SET hive.tez.max.partition.factor=20;

    SET hive.exec.reducers.bytes.per.reducer=128000000;

Also, you may want to look at the best practices to create ORC tables, mentioned here , so that you can have the maximum of your queries in the minimum of time!

Hope that helps!

avatar
Contributor

Will usage of DISTRIBUTE BY or SORT BY be helpful?