- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Cannot generate stats for partitioned Hive table
- Labels:
-
Apache Hive
Created ‎05-17-2017 02:48 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am attempting to perform an ANALYZE on a partitioned table to generate statistics for numRows and totalSize. For a non-partitioned table I get the results I am looking for but for a dynamic partitioned table it does not provide the information I am seeking. As a newbie to Hive, I assume I am doing something wrong.
We are running Hive 1.2.1.2.5.
Test Case 1:
CREATE TABLE `test_table` (`date` DATE,`name` VARCHAR(255)); INSERT INTO `test_table` VALUES ('2016-12-30','john'),('2016-12-30','sarah'), ('2016-12-31','bill'), ('2017-01-01','mary'),('2017-01-01','pete'); ANALYZE TABLE `test_table` COMPUTE STATISTICS; SHOW tblproperties `test_table`;
The results are as I would expect:
prpt_name | prpt_value |
COLUMN_STATS_ACCURATE | {"BASIC_STATS":"true"} |
numFiles | 1 |
numRows | 5 |
rawDataSize | 76 |
totalSize | 81 |
transient_lastDdlTime | 1494980403 |
Test Case 2:
CREATE TABLE `test_table` (`name` VARCHAR(255)) PARTITIONED BY (`date` DATE); INSERT INTO `test_table` PARTITION (`date`='2016-12-30') VALUES ('john'),('sarah'); INSERT INTO `test_table` PARTITION (`date`='2016-12-31') VALUES ('bill'); INSERT INTO `test_table` PARTITION (`date`='2017-01-01') VALUES ('mary'),('pete'); ANALYZE TABLE `test_table` PARTITION(`date`) COMPUTE STATISTICS; SHOW tblproperties `test_table`;
However, I do not see any of the statistics I would expect to see:
prpt_name | prpt_value |
transient_lastDdlTime | 1494978354 |
Is there something I am doing wrong with the table creation or with the analyze statement?
EDIT: Per suggestion, I did test using the following analyze statement as well with the same results:
ANALYZE TABLE `test_table` PARTITION(`date`='2016-12-30',`date`='2016-12-31',`date`='2017-01-01') COMPUTE STATISTICS;
Created ‎05-18-2017 06:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The stats for partitioned table are available per partition, you can do desc formatted, example:
hive> desc formatted `test_table` partition(`date`='2016-12-30'); ... Partition Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 1 numRows 2 rawDataSize 9 totalSize 11 transient_lastDdlTime 1495086721
Created ‎05-17-2017 04:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you try ANALYZE TABLE test_table PARTITION('2016-12-30') COMPUTE STATISTICS; and check the stats again ?
Created ‎05-17-2017 04:55 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Invalid syntax:
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:35 cannot recognize input near ''2016-12-30'' ')' 'COMPUTE' in analyze statement<br>
According to the Hive documentation the partition column and name need to be specified if you are analyzing a particular partition. Specifying the column name only can be used to analyze all partitions.https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ExistingTables–ANALYZE
Created ‎05-18-2017 06:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The stats for partitioned table are available per partition, you can do desc formatted, example:
hive> desc formatted `test_table` partition(`date`='2016-12-30'); ... Partition Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 1 numRows 2 rawDataSize 9 totalSize 11 transient_lastDdlTime 1495086721
Created ‎05-18-2017 04:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ah, I see. Thank you @Deepesh. I was able to confirm viewing the statistics of interest when specifying a specific partition. I have a follow up question based on that.
Is there a single command to view the stats summed for all partitions or would I need to execute the above for each partition in the table and total the values myself?
Created ‎05-18-2017 07:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Overall information is available across partitions only if you are on HDP-2.6 and using Hive2 which i see you are currently not from your description (1.2.1.2.5). If you are using hive 2 the information would be something like:
beeline> desc formatted `test_table`; | Table Parameters: | | | | | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\"} | | | numFiles | 3 | | | numRows | 5 | | | rawDataSize | 21 | | | totalSize | 26 | | | transient_lastDdlTime | 1495086467 |
In case of Hive 1 unfortunately you will have to run the command per partition and aggregate manually.
Created ‎09-23-2019 09:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Using the smae example od data, I've got something like that in my result, trying to have details of one or more than one partition with (desc formatted operation)
The problem here si that each time I've got only information about one and only one partition, even if I'm specifying two or three.
Do you know please how HIVE behave in this case ? Is it a bug ?
Thank you
