Support Questions

Find answers, ask questions, and share your expertise

Run Multiple Count Operation On Data Table

avatar
Rising Star

Have a list of about 100+ SQL  Count Queries to run against a Hive Data Table,

Looking for the most efficient way to run these queries.

Queries are accessed at runtime as a list of queries stored in another Hive Table, generated by a different process.  

 

Queries like these each with a different where clause, where clauses are complex:

1. Select count(1) as count1 from MyTable where (... complex where clause here...)

2. Select Count(1) as Count1 from MyTable where (... where clause here ...)

3. etc..

 

Environment:

Cloudera CDH 6.2

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

With your original approach, each query can filter out whole partitions of the table based on the WHERE clauses (that is if your table is partitioned and at least some of the columns in the clause use those partitions). However, if your WHERE clauses are pretty different/unique, then you will be scanning big portion of the table for every one of your 100+ queries.

 

With the suggested approach, there is only one scan of the table, but there is more processing that is happening for each row. The best way to see if performance is better is just to test it and go with the winner.

View solution in original post

6 REPLIES 6

avatar
Master Collaborator

Do your WHERE conditions rely on different columns in MyTable or all the same columns, just different filter criteria? If it's the latter than the answer is partitioning your Hive table based on those key columns.

 

Also if your MyTable is not too big, it would be most efficient to do your 100 queries in memory with something like SparkSQL, rather than Hive. 

avatar
Rising Star

Where clause relies on multiple columns

avatar

Impala would probably give you the fastest response time.

Personally, I would write a script (Python on whatever) that fetched the queries and just ran them one by one.

You could try to combine together the queries in various ways if you really cared about reducing latency (I'm not sure that any of these alternatives would make a massive difference, but maybe some amount).

 

E.g. the following would require only a single scan of the table (although it might be more expensive cause you don't have filtering from the where clause).

Select count(case when <where clause 1> then 1 end), count(case when <where clause 2> then 1 end)
from MyTable

 

 

 

 

avatar
Rising Star

Please explain this

" ... E.g. the following would require only a single scan of the table (although it might be more expensive cause you don't have filtering from the where clause)...."

 

not sure what you mean by the comment in brackets

how will it be more expensive, what can we do to fix that.

 

avatar
Master Collaborator

With your original approach, each query can filter out whole partitions of the table based on the WHERE clauses (that is if your table is partitioned and at least some of the columns in the clause use those partitions). However, if your WHERE clauses are pretty different/unique, then you will be scanning big portion of the table for every one of your 100+ queries.

 

With the suggested approach, there is only one scan of the table, but there is more processing that is happening for each row. The best way to see if performance is better is just to test it and go with the winner.

avatar
Rising Star

could you give a working example of this in spark 2.4 using scala dataframe

can't seem to find the correct syntax...

 

val result = dataFrame.select(count(when( col("col_1") === "val_1" &&  col("col_2") === "val_2", 1)