Support Questions

Find answers, ask questions, and share your expertise

What's the fastest way to know if an Hive Table contains data?

avatar

Wanted to know what's the fastest/most efficient way to know if there exists data in a partitioned table. select * from database.table limit 1 according to the query plan does a full table scan. Is there any way to avoid this and quickly know if there exists any data in a given table/partition of a table?

3 REPLIES 3

avatar

1. select * from database.table limit 1 It will never perform a full table scan. You can verify this on Resource Manager with no new job being spawned.


2. You can find the HDFS path corresponding to the table and do a hdfs -du on the path to know what is the size of directory.

avatar

Thanks for the quick response @kgautam

1. Due to a bug we have had to disable the default fetch operation by default and hence a TEZ session is spawned for the operation at the moment for every select * limit 1 operation, also since we use LLAP we have observed there is considerable IO within the mappers from the metrics printed after the query execution.

2. Is this a good approach?

avatar

2 option is not the perfect approach but is full proof, as its never possible for HDFS directory to be empty while Hive has data.