Created 06-27-2017 09:25 PM
i have a table with 1 million rows and i can select from it but when i try count(*) from hive it shows 0 rows ?
hive> select txn_id,msg_id from pa_lane_txn limit 10; OK 19651715642 PRKG 19651715644 TTOL 19651715646 ETKT 19651715648 TTOL 19651715650 TTOL 19651715654 XTKT 19651715656 XTKT 19651715658 TTOL 19651715660 TTOL 19651715662 TTOL Time taken: 0.142 seconds, Fetched: 10 row(s) hive> select count(*) from pa_lane_txn; OK 0 Time taken: 0.083 seconds, Fetched: 1 row(s) hive>
Created 06-27-2017 09:42 PM
@Sami Ahmad Do you have any Ranger policies in place?
Created 06-27-2017 10:00 PM
Run as below:
select count(*)from pa_lane_txn limit 2;
In above process it might not triggering any mapreduce jobs and just getting from metadata of table status which might not be updated. If this helps, please 'accept' it.
Created 06-28-2017 01:19 PM
yes that gave me the count but that's not a solution ,its a work around . I would like to know why the count(*) is not working and how can I fix it ?
also if you notice the time it took for the count(*) limit 2 is very long .
hive> select count(*) from pa_lane_txn; OK 0 Time taken: 1.372 seconds, Fetched: 1 row(s) hive> select count(*) from pa_lane_txn limit 2; Query ID = hdfs_20170628090950_5997e0e1-3524-4af7-9d3d-094490af3d56 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1498486352413_0031) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 2 2 0 0 0 0 Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 154.75 s -------------------------------------------------------------------------------- OK 9999999 Time taken: 157.841 seconds, Fetched: 1 row(s) hive>
Created 06-29-2017 04:08 AM
Hope @Ishan has provided the correct reason for route cause. Let me know if you have any questions.
Created 06-28-2017 02:33 PM
@Sami Ahmad For some tables hive just looks at the table metadata and fetches the values which might not have been updated. There are two ways to approach this.
1. You can run ANALYZE TABLE pa_lane_txn and then run select count(*) statement. This will give you the correct value.
2. You can force hive to run a mapreduce job to count the number of rows by setting fetch task conversion to none;
hive> set hive.fetch.task.conversion=none;
Created 11-08-2017 02:02 PM
It works after I set this in hive-site.xml
I am not sure the reason, but it was forced to run mapreduce
hive.compute.query.using.stats=false
I also found something in hive metastore (mine is MySQL).
mysql> SELECT PARAM_VALUE FROM PARTITION_PARAMS AS A, PARTITIONS AS B WHERE A.PARAM_KEY='numRows' and A.PART_ID=B.PART_ID and A.PARAM_VALUE = 0 limit 10;
It seems if PARAM_VALUE is not -1, hive will return this value without running mapreduce.
If PARAM_VALUE equals to -1, mapreduce will be executed.
Created 10-23-2018 04:44 PM
Its better not to disturb the properties on the statistics usage like hive.compute.query.using.stats. It impacts the way the statistics are used in your query for performance optimization and execution plans. It has tremendous influence on execution plans, the statistics stored depends on the file format as well. Therefore definitely not a solution to change any property with regards to statistics.
The real reason for count not working correctly is the statistics not updated in the hive due to which it returns 0. When a table is created first, the statistics is written with no data rows. Thereafter any data append/change happens hive requires to update this statistics in the metadata. Depending on the circumstances hive might not be updating this real time.
Therefore running the ANALYZE command recomputes this statistics to make this work correctly.