Support Questions

Find answers, ask questions, and share your expertise

hive count(*) not working

avatar
Super Collaborator

i have a table with 1 million rows and i can select from it but when i try count(*) from hive it shows 0 rows ?

hive> select txn_id,msg_id from pa_lane_txn limit 10;
OK
19651715642     PRKG
19651715644     TTOL
19651715646     ETKT
19651715648     TTOL
19651715650     TTOL
19651715654     XTKT
19651715656     XTKT
19651715658     TTOL
19651715660     TTOL
19651715662     TTOL
Time taken: 0.142 seconds, Fetched: 10 row(s)
hive> select count(*) from pa_lane_txn;
OK
0
Time taken: 0.083 seconds, Fetched: 1 row(s)
hive>


7 REPLIES 7

avatar

@Sami Ahmad Do you have any Ranger policies in place?

avatar
@Sami Ahmad

Run as below:

select count(*)from pa_lane_txn limit 2;

In above process it might not triggering any mapreduce jobs and just getting from metadata of table status which might not be updated. If this helps, please 'accept' it.

avatar
Super Collaborator

yes that gave me the count but that's not a solution ,its a work around . I would like to know why the count(*) is not working and how can I fix it ?

also if you notice the time it took for the count(*) limit 2 is very long .

hive> select count(*) from pa_lane_txn;
OK
0
Time taken: 1.372 seconds, Fetched: 1 row(s)
hive> select count(*) from pa_lane_txn limit 2;
Query ID = hdfs_20170628090950_5997e0e1-3524-4af7-9d3d-094490af3d56
Total jobs = 1
Launching Job 1 out of 1

Status: Running (Executing on YARN cluster with App id application_1498486352413_0031)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      2          2        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 154.75 s
--------------------------------------------------------------------------------
OK
9999999
Time taken: 157.841 seconds, Fetched: 1 row(s)
hive>


avatar
@Sami Ahmad

Hope @Ishan has provided the correct reason for route cause. Let me know if you have any questions.

avatar
Rising Star

@Sami Ahmad For some tables hive just looks at the table metadata and fetches the values which might not have been updated. There are two ways to approach this.

1. You can run ANALYZE TABLE pa_lane_txn and then run select count(*) statement. This will give you the correct value.

2. You can force hive to run a mapreduce job to count the number of rows by setting fetch task conversion to none;

hive> set hive.fetch.task.conversion=none;

avatar
Explorer

@Sami Ahmad

It works after I set this in hive-site.xml

I am not sure the reason, but it was forced to run mapreduce

hive.compute.query.using.stats=false

I also found something in hive metastore (mine is MySQL).

mysql> SELECT PARAM_VALUE FROM PARTITION_PARAMS AS A, PARTITIONS AS B WHERE A.PARAM_KEY='numRows' and A.PART_ID=B.PART_ID and A.PARAM_VALUE = 0 limit 10;

It seems if PARAM_VALUE is not -1, hive will return this value without running mapreduce.

If PARAM_VALUE equals to -1, mapreduce will be executed.

avatar
New Contributor

Its better not to disturb the properties on the statistics usage like hive.compute.query.using.stats. It impacts the way the statistics are used in your query for performance optimization and execution plans. It has tremendous influence on execution plans, the statistics stored depends on the file format as well. Therefore definitely not a solution to change any property with regards to statistics.

The real reason for count not working correctly is the statistics not updated in the hive due to which it returns 0. When a table is created first, the statistics is written with no data rows. Thereafter any data append/change happens hive requires to update this statistics in the metadata. Depending on the circumstances hive might not be updating this real time.

Therefore running the ANALYZE command recomputes this statistics to make this work correctly.