Created on 11-11-2020 08:15 AM - edited 09-16-2022 07:39 AM
Have a list of about 100+ SQL Count Queries to run against a Hive Data Table,
Looking for the most efficient way to run these queries.
Queries are accessed at runtime as a list of queries stored in another Hive Table, generated by a different process.
Queries like these each with a different where clause, where clauses are complex:
1. Select count(1) as count1 from MyTable where (... complex where clause here...)
2. Select Count(1) as Count1 from MyTable where (... where clause here ...)
3. etc..
Environment:
Cloudera CDH 6.2
Created 11-13-2020 01:00 PM
With your original approach, each query can filter out whole partitions of the table based on the WHERE clauses (that is if your table is partitioned and at least some of the columns in the clause use those partitions). However, if your WHERE clauses are pretty different/unique, then you will be scanning big portion of the table for every one of your 100+ queries.
With the suggested approach, there is only one scan of the table, but there is more processing that is happening for each row. The best way to see if performance is better is just to test it and go with the winner.
Created 11-12-2020 10:25 AM
Do your WHERE conditions rely on different columns in MyTable or all the same columns, just different filter criteria? If it's the latter than the answer is partitioning your Hive table based on those key columns.
Also if your MyTable is not too big, it would be most efficient to do your 100 queries in memory with something like SparkSQL, rather than Hive.
Created 11-13-2020 10:49 AM
Where clause relies on multiple columns
Created 11-12-2020 08:05 PM
Impala would probably give you the fastest response time.
Personally, I would write a script (Python on whatever) that fetched the queries and just ran them one by one.
You could try to combine together the queries in various ways if you really cared about reducing latency (I'm not sure that any of these alternatives would make a massive difference, but maybe some amount).
E.g. the following would require only a single scan of the table (although it might be more expensive cause you don't have filtering from the where clause).
Select count(case when <where clause 1> then 1 end), count(case when <where clause 2> then 1 end)
from MyTable
Created 11-13-2020 10:55 AM
Please explain this
" ... E.g. the following would require only a single scan of the table (although it might be more expensive cause you don't have filtering from the where clause)...."
not sure what you mean by the comment in brackets
how will it be more expensive, what can we do to fix that.
Created 11-13-2020 01:00 PM
With your original approach, each query can filter out whole partitions of the table based on the WHERE clauses (that is if your table is partitioned and at least some of the columns in the clause use those partitions). However, if your WHERE clauses are pretty different/unique, then you will be scanning big portion of the table for every one of your 100+ queries.
With the suggested approach, there is only one scan of the table, but there is more processing that is happening for each row. The best way to see if performance is better is just to test it and go with the winner.
Created 11-13-2020 09:35 PM
could you give a working example of this in spark 2.4 using scala dataframe
can't seem to find the correct syntax...
val result = dataFrame.select(count(when( col("col_1") === "val_1" && col("col_2") === "val_2", 1)