Support Questions

Find answers, ask questions, and share your expertise

Cannot generate stats for partitioned Hive table

avatar
New Contributor

I am attempting to perform an ANALYZE on a partitioned table to generate statistics for numRows and totalSize. For a non-partitioned table I get the results I am looking for but for a dynamic partitioned table it does not provide the information I am seeking. As a newbie to Hive, I assume I am doing something wrong.

We are running Hive 1.2.1.2.5.

Test Case 1:

CREATE TABLE `test_table` (`date` DATE,`name` VARCHAR(255));

INSERT INTO `test_table` VALUES
('2016-12-30','john'),('2016-12-30','sarah'),
('2016-12-31','bill'),
('2017-01-01','mary'),('2017-01-01','pete');

ANALYZE TABLE `test_table` COMPUTE STATISTICS;

SHOW tblproperties `test_table`;

The results are as I would expect:

prpt_nameprpt_value
COLUMN_STATS_ACCURATE{"BASIC_STATS":"true"}
numFiles1
numRows5
rawDataSize76
totalSize81
transient_lastDdlTime1494980403

Test Case 2:

CREATE TABLE `test_table` (`name` VARCHAR(255)) PARTITIONED BY (`date` DATE);

INSERT INTO `test_table` PARTITION (`date`='2016-12-30')
VALUES ('john'),('sarah');

INSERT INTO `test_table` PARTITION (`date`='2016-12-31')
VALUES ('bill');

INSERT INTO `test_table` PARTITION (`date`='2017-01-01')
VALUES ('mary'),('pete');

ANALYZE TABLE `test_table` PARTITION(`date`) COMPUTE STATISTICS;

SHOW tblproperties `test_table`;

However, I do not see any of the statistics I would expect to see:

prpt_nameprpt_value
transient_lastDdlTime1494978354

Is there something I am doing wrong with the table creation or with the analyze statement?

EDIT: Per suggestion, I did test using the following analyze statement as well with the same results:

ANALYZE TABLE `test_table`
PARTITION(`date`='2016-12-30',`date`='2016-12-31',`date`='2017-01-01')
COMPUTE STATISTICS;
1 ACCEPTED SOLUTION

avatar

The stats for partitioned table are available per partition, you can do desc formatted, example:

hive> desc formatted `test_table` partition(`date`='2016-12-30');
...
Partition Parameters:
	COLUMN_STATS_ACCURATE	{\"BASIC_STATS\":\"true\"}
	numFiles            	1
	numRows             	2
	rawDataSize         	9
	totalSize           	11
	transient_lastDdlTime	1495086721

View solution in original post

6 REPLIES 6

avatar
Guru

@John Glorioso

Can you try ANALYZE TABLE test_table PARTITION('2016-12-30') COMPUTE STATISTICS; and check the stats again ?

avatar
New Contributor

Invalid syntax:

org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:35 cannot recognize input near ''2016-12-30'' ')' 'COMPUTE' in analyze statement<br>

According to the Hive documentation the partition column and name need to be specified if you are analyzing a particular partition. Specifying the column name only can be used to analyze all partitions.https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ExistingTables–ANALYZE

avatar

The stats for partitioned table are available per partition, you can do desc formatted, example:

hive> desc formatted `test_table` partition(`date`='2016-12-30');
...
Partition Parameters:
	COLUMN_STATS_ACCURATE	{\"BASIC_STATS\":\"true\"}
	numFiles            	1
	numRows             	2
	rawDataSize         	9
	totalSize           	11
	transient_lastDdlTime	1495086721

avatar
New Contributor

Ah, I see. Thank you @Deepesh. I was able to confirm viewing the statistics of interest when specifying a specific partition. I have a follow up question based on that.

Is there a single command to view the stats summed for all partitions or would I need to execute the above for each partition in the table and total the values myself?

avatar

Overall information is available across partitions only if you are on HDP-2.6 and using Hive2 which i see you are currently not from your description (1.2.1.2.5). If you are using hive 2 the information would be something like:

beeline> desc formatted `test_table`;
| Table Parameters: |                       |                            |
|                   | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\"} |
|                   | numFiles              | 3                          |
|                   | numRows               | 5                          |
|                   | rawDataSize           | 21                         |
|                   | totalSize             | 26                         |
|                   | transient_lastDdlTime | 1495086467                 |

In case of Hive 1 unfortunately you will have to run the command per partition and aggregate manually.

avatar
New Contributor

Using the smae example od data, I've got something like that in my result, trying to have details of one or more than one partition with (desc formatted operation)

 

bad_analyze.PNG

 

The problem here si that each time I've got only information about one and only one partition, even if I'm specifying two or three. 
Do you know please how HIVE behave in this case ? Is it a bug ? 

 

Thank you