Created 03-14-2023 06:01 AM
Dear all,
I want to run COMPUTE INCREMENTAL STATS with PARTITION option on some partitions of a partitioned table.
It is unclear to me from the documentation if DROP INCREMENTAL STATS must be run on those partitions.
For my understanding of how INCREMENTAL STATS with the PARTITION option work, it should not be required, but I've found below sentence on this page https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_perf_stats.html#perf_stats
"In cases where new files are added to an existing partition, issue a REFRESH statement for the table, followed by a DROP INCREMENTAL STATS and COMPUTE INCREMENTAL STATS sequence for the changed partition.".
I'd prefer to avoid running DROP INCREMENTAL STATS if not required.
Does anyone know?
Many thanks in advance.
Created 03-24-2023 10:05 AM
Thanks for the offer and sorry for the late reply. We use CDP 7.4.
In the meantime I have been able to also receive an answer from Cloudera support and they have confirmed that in our scenario we do not need to run the DROP INCREMENTAL STATS command. I do not think that additional tests are required.
Created 03-14-2023 09:13 AM
@Me Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our CDP experts @PabitraDas and @aakulov who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created on 03-14-2023 09:14 AM - edited 03-14-2023 09:16 AM
Hi,
This statement in the doc
"In cases where new files are added to an existing partition, issue a REFRESH statement for the table, followed by a DROP INCREMENTAL STATS and COMPUTE INCREMENTAL STATS sequence for the changed partition."
Applies specifically to a partition in which stats are already available but you added more data to that existing partition.
If you are unsure about whether stats exist for a partition you can run show table stats <table_name>; and check the "Incremental stats" section
Query: show table stats test_part
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------------------------+
| b | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------------------------+
| 1 | 0 | 1 | 0B | NOT CACHED | NOT CACHED | TEXT | false | hdfs://xxxx:8020/user/hive/warehouse/test_part/b=1 |
| Total | -1 | 1 | 0B | 0B | | | | |
+-------+-------+--------+------+--------------+-------------------+--------+-------------------+--------------------------------------------------------------------------+
Fetched 2 row(s) in 5.60s
If false, you can run COMPUTE INCREMENTAL STATS with PARTITION
If true and you have added more data to this partition then you have to drop the stats and then run COMPUTE INCREMENTAL STATS with PARTITION
Created 03-14-2023 10:06 AM
Hello,
this is not answering my question. I'd like to know why I must run the DROP INCREMENTAL STATS in my scenario.
I have run a small test:
- created a partitioned table
- run COMPUTE INCREMENTAL STATS
- run SHOW TABLE STATS and saved the result
- added one row to one partition
- run COMPUTE INCREMENTAL STATS PARTITION (<partitioning key of the partition to which the row was added)
- run SHOW TABLE STATS and compare to the previous run
Result of the comparison: the stats of the partition to which I have added one row were updated and reflecting the number of rows in the partition / number of files.
Based on my tests it looks like the stats are updated, hence my question.
Created 03-16-2023 03:11 PM
@Me Sorry for that confusion. I see what you mean now
Per: https://impala.apache.org/docs/build/html/topics/impala_perf_stats.html#perf_stats_incremental
In Impala 2.1.0 and higher, you can use the COMPUTE INCREMENTAL STATS and DROP INCREMENTAL STATS commands. The INCREMENTAL clauses work with incremental statistics, a specialized feature for partitioned tables.
When you compute incremental statistics for a partitioned table, by default Impala only processes those partitions that do not yet have incremental statistics. By processing only newly added partitions, you can keep statistics up to date without incurring the overhead of reprocessing the entire table each time.
So the drop statistics is intended for "COMPUTE INCREMENTAL STATS" and not for " COMPUTE INCREMENTAL STATS with partition"
May I know which version of CDP you are using, so that I can test on my end and confirm you.
Created 03-24-2023 10:05 AM
Thanks for the offer and sorry for the late reply. We use CDP 7.4.
In the meantime I have been able to also receive an answer from Cloudera support and they have confirmed that in our scenario we do not need to run the DROP INCREMENTAL STATS command. I do not think that additional tests are required.
Created 03-20-2023 11:21 AM
@Me Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. If you are still experiencing the issue, can you provide the information venkatsambath has requested? Thanks.
Regards,
Diana Torres,Created 03-24-2023 10:12 AM
@DianaTorres I was able to received the reply to my question through Cloudera Support - see also my comment above. Should I mark my reply above as Solution?
Created 03-24-2023 10:35 AM
@Me Yes, that is the solution to your post, thank you for coming back with the fix. This will help more users in the future. Thanks for your contribution!
Regards,
Diana Torres,