Support Questions

Find answers, ask questions, and share your expertise

COMPUTE INCREMENTAL STATS with PARTITION option: do I need to run DROP INCREMENTAL STATS before it?

avatar
Explorer

Dear all,

 

I want to run COMPUTE INCREMENTAL STATS with PARTITION option on some partitions of a partitioned table.

It is unclear to me from the documentation if DROP INCREMENTAL STATS must be run on those partitions.

 

For my understanding of how INCREMENTAL STATS with the PARTITION option work, it should not be required, but I've found below sentence on this page https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_perf_stats.html#perf_stats

 

"In cases where new files are added to an existing partition, issue a REFRESH statement for the table, followed by a DROP INCREMENTAL STATS and COMPUTE INCREMENTAL STATS sequence for the changed partition.".

 

I'd prefer to avoid running DROP INCREMENTAL STATS if not required.

 

Does anyone know?

 

Many thanks in advance.

1 ACCEPTED SOLUTION

avatar
Explorer

Thanks for the offer and sorry for the late reply. We use CDP 7.4. 

 

In the meantime I have been able to also receive an answer from Cloudera support and they have confirmed that in our scenario we do not need to run the DROP INCREMENTAL STATS command. I do not think that additional tests are required.

View solution in original post

8 REPLIES 8

avatar
Community Manager

@Me Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our CDP experts @PabitraDas and @aakulov  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar

Hi,

 

This statement in the doc

"In cases where new files are added to an existing partition, issue a REFRESH statement for the table, followed by a DROP INCREMENTAL STATS and COMPUTE INCREMENTAL STATS sequence for the changed partition."

Applies specifically to a partition in which stats are already available but you added more data to that existing partition.

If you are unsure about whether stats exist for a partition you can run show table stats <table_name>; and check the "Incremental stats" section 

 

Query: show table stats test_part
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------------------------+
| b | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------------------------+
| 1 | 0 | 1 | 0B | NOT CACHED | NOT CACHED | TEXT | false | hdfs://xxxx:8020/user/hive/warehouse/test_part/b=1 |
| Total | -1 | 1 | 0B | 0B | | | | |
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------------------------+
Fetched 2 row(s) in 5.60s

 

If false, you can run COMPUTE INCREMENTAL STATS with PARTITION 

If true and you have added more data to this partition then you have to drop the stats and then run COMPUTE INCREMENTAL STATS with PARTITION 

avatar
Explorer

Hello, 

 

this is not answering my question. I'd like to know why I must run the DROP INCREMENTAL STATS in my scenario.

 

I have run a small test:

- created a partitioned table

- run COMPUTE INCREMENTAL STATS

- run SHOW TABLE STATS and saved the result

- added one row to one partition

- run COMPUTE INCREMENTAL STATS PARTITION (<partitioning key of the partition to which the row was added)

- run SHOW TABLE STATS and compare to the previous run

 

Result of the comparison: the stats of the partition to which I have added one row were updated and reflecting the number of rows in the partition / number of files.

 

Based on my tests it looks like the stats are updated, hence my question.

avatar

@Me Sorry for that confusion. I see what you mean now

 

Per: https://impala.apache.org/docs/build/html/topics/impala_perf_stats.html#perf_stats_incremental

COMPUTE INCREMENTAL STATS

In Impala 2.1.0 and higher, you can use the COMPUTE INCREMENTAL STATS and DROP INCREMENTAL STATS commands. The INCREMENTAL clauses work with incremental statistics, a specialized feature for partitioned tables.

When you compute incremental statistics for a partitioned table, by default Impala only processes those partitions that do not yet have incremental statistics. By processing only newly added partitions, you can keep statistics up to date without incurring the overhead of reprocessing the entire table each time.

 

So the drop statistics is intended for "COMPUTE INCREMENTAL STATS" and not for " COMPUTE INCREMENTAL STATS with partition"

 

May I know which version of CDP you are using, so that I can test on my end and confirm you.

avatar
Explorer

Thanks for the offer and sorry for the late reply. We use CDP 7.4. 

 

In the meantime I have been able to also receive an answer from Cloudera support and they have confirmed that in our scenario we do not need to run the DROP INCREMENTAL STATS command. I do not think that additional tests are required.

avatar
Community Manager

@Me Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. If you are still experiencing the issue, can you provide the information venkatsambath has requested? Thanks.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Explorer

@DianaTorres I was able to received the reply to my question through Cloudera Support - see also my comment above. Should I mark my reply above as Solution?  

avatar
Community Manager

@Me Yes, that is the solution to your post, thank you for coming back with the fix. This will help more users in the future. Thanks for your contribution!


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: