Created 05-17-2017 02:48 AM
I am attempting to perform an ANALYZE on a partitioned table to generate statistics for numRows and totalSize. For a non-partitioned table I get the results I am looking for but for a dynamic partitioned table it does not provide the information I am seeking. As a newbie to Hive, I assume I am doing something wrong.
We are running Hive 1.2.1.2.5.
Test Case 1:
CREATE TABLE `test_table` (`date` DATE,`name` VARCHAR(255)); INSERT INTO `test_table` VALUES ('2016-12-30','john'),('2016-12-30','sarah'), ('2016-12-31','bill'), ('2017-01-01','mary'),('2017-01-01','pete'); ANALYZE TABLE `test_table` COMPUTE STATISTICS; SHOW tblproperties `test_table`;
The results are as I would expect:
prpt_name | prpt_value |
COLUMN_STATS_ACCURATE | {"BASIC_STATS":"true"} |
numFiles | 1 |
numRows | 5 |
rawDataSize | 76 |
totalSize | 81 |
transient_lastDdlTime | 1494980403 |
Test Case 2:
CREATE TABLE `test_table` (`name` VARCHAR(255)) PARTITIONED BY (`date` DATE); INSERT INTO `test_table` PARTITION (`date`='2016-12-30') VALUES ('john'),('sarah'); INSERT INTO `test_table` PARTITION (`date`='2016-12-31') VALUES ('bill'); INSERT INTO `test_table` PARTITION (`date`='2017-01-01') VALUES ('mary'),('pete'); ANALYZE TABLE `test_table` PARTITION(`date`) COMPUTE STATISTICS; SHOW tblproperties `test_table`;
However, I do not see any of the statistics I would expect to see:
prpt_name | prpt_value |
transient_lastDdlTime | 1494978354 |
Is there something I am doing wrong with the table creation or with the analyze statement?
EDIT: Per suggestion, I did test using the following analyze statement as well with the same results:
ANALYZE TABLE `test_table` PARTITION(`date`='2016-12-30',`date`='2016-12-31',`date`='2017-01-01') COMPUTE STATISTICS;
Created 05-18-2017 06:07 AM
The stats for partitioned table are available per partition, you can do desc formatted, example:
hive> desc formatted `test_table` partition(`date`='2016-12-30'); ... Partition Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 1 numRows 2 rawDataSize 9 totalSize 11 transient_lastDdlTime 1495086721
Created 05-17-2017 04:24 AM
Can you try ANALYZE TABLE test_table PARTITION('2016-12-30') COMPUTE STATISTICS; and check the stats again ?
Created 05-17-2017 04:55 AM
Invalid syntax:
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:35 cannot recognize input near ''2016-12-30'' ')' 'COMPUTE' in analyze statement<br>
According to the Hive documentation the partition column and name need to be specified if you are analyzing a particular partition. Specifying the column name only can be used to analyze all partitions.https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ExistingTables–ANALYZE
Created 05-18-2017 06:07 AM
The stats for partitioned table are available per partition, you can do desc formatted, example:
hive> desc formatted `test_table` partition(`date`='2016-12-30'); ... Partition Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 1 numRows 2 rawDataSize 9 totalSize 11 transient_lastDdlTime 1495086721
Created 05-18-2017 04:51 PM
Ah, I see. Thank you @Deepesh. I was able to confirm viewing the statistics of interest when specifying a specific partition. I have a follow up question based on that.
Is there a single command to view the stats summed for all partitions or would I need to execute the above for each partition in the table and total the values myself?
Created 05-18-2017 07:57 PM
Overall information is available across partitions only if you are on HDP-2.6 and using Hive2 which i see you are currently not from your description (1.2.1.2.5). If you are using hive 2 the information would be something like:
beeline> desc formatted `test_table`; | Table Parameters: | | | | | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\"} | | | numFiles | 3 | | | numRows | 5 | | | rawDataSize | 21 | | | totalSize | 26 | | | transient_lastDdlTime | 1495086467 |
In case of Hive 1 unfortunately you will have to run the command per partition and aggregate manually.
Created 09-23-2019 09:12 AM
Using the smae example od data, I've got something like that in my result, trying to have details of one or more than one partition with (desc formatted operation)
The problem here si that each time I've got only information about one and only one partition, even if I'm specifying two or three.
Do you know please how HIVE behave in this case ? Is it a bug ?
Thank you