Support Questions

Find answers, ask questions, and share your expertise

Compaction stuck Ready for Cleaning

avatar
Contributor

Hi all,

in my test cluster I am noticing a slowdown in the execution of acid queries.

Analyzing in detail I noticed that the compactions remain stuck at "ready for cleaning" and there are many delta files.

 

delta files.png

I also tried to manually launch the compaction without any result. hive.metastore.housekeeping.threads.on and hive.metastore.housekeeping.threads.on is true only in 1 hive metastore host.

This is a table properties:

bucketing_version 2
transactional true
transactional_properties default
transient_lastDdlTime 1720453037

In the development cluster with the identical configuration I do not have this problem. Do you have any suggestions?

I'm running in CDP 7.1.9

Thanks, Lorenzo

1 ACCEPTED SOLUTION

avatar
Expert Contributor
@Lorenzo The issue seems to be related to HIVE-27191 where some mhl_txnids do not exist in TXNS,completed_txn_components txn_components table but they are still present in min_history_level table, as a result, the cleaner gets blocked and many entries are stuck in the ready-for-cleaning state. To confirm that collect the output of below query
 
SELECT MHL_TXNID FROM HIVE.MIN_HISTORY_LEVEL WHERE MHL_MIN_OPEN_TXNID = (SELECT MIN(MHL_MIN_OPEN_TXNID) FROM HIVE.MIN_HISTORY_LEVEL);
 
Once we get the output of the above query check if those txn ids are there in TXNS,completed_txn_components txn_components tables using below commands.
 
select * from txn_components where tc_txnid IN (MHL_TXNID );
select * from completed_txn_components where ctc_txnid IN (MHL_TXNID);
select * from TXNS where ctc_txnid IN (MHL_TXNID);
 
If we got 0 results from the above queries this confirms that the MHL_TXNIDs we got above are orphans and we need to remove them in order to unblock the cleaner.
 
delete from MIN_HISTORY_LEVEL where MHL_TXNID=13422; --(repeat for all)
 
Hope this helps you in resolving the issue

View solution in original post

6 REPLIES 6

avatar
Super Collaborator

Please verify if there are any long-running transactions on the cluster and, if found, consider aborting them using the "abort transactions" command, if it is safe to do so.

You can use the "show transactions" command in Beeline to validate the long-running transactions.

Another alternative is to use the following backend DB query .

SELECT * FROM "TXNS" WHERE "TXN_ID" = ( SELECT min(res.id) FROM ( SELECT "NTXN_NEXT" AS id FROM "NEXT_TXN_ID" UNION ALL SELECT "MHL_TXNID" FROM "MIN_HISTORY_LEVEL" WHERE "MHL_TXNID" = ( SELECT min("MHL_MIN_OPEN_TXNID") FROM "MIN_HISTORY_LEVEL" ) ) res)


Note: This query is for postgres DB,  modify it depending upon the backend DB in which you're using. 

avatar
Contributor

Hi,

actually launching the show transactions command I see open transactions from June 13th. I tried to do abort transactions "id" but I receive the following error:
Error while compiling statement: FAILED: Execution Error, return code 40000 from org.apache.hadoop.hive.ql.ddl.DDLTask. org.apache.thrift.TApplicationException: Internal error processing abort_txns
 
INFO : Completed compiling command(queryId=hive_20240822152836_9e17b9ba-4e7b-44d1-a710-8e086edd8da0); Time taken: 0.001 seconds
INFO : Executing command(queryId=hive_20240822152836_9e17b9ba-4e7b-44d1-a710-8e086edd8da0): abort transactions 13422
INFO : Starting task [Stage-0:DDL] in serial mode
ERROR : Failed
org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Internal error processing abort_txns
at org.apache.hadoop.hive.ql.metadata.Hive.abortTransactions(Hive.java:5549) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.ddl.process.abort.AbortTransactionsOperation.execute(AbortTransactionsOperation.java:35) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.ddl.DDLTask.execute(DDLTask.java:82) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:357) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:785) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:524) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:518) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:234) ~[hive-service-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:91) ~[hive-service-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:334) ~[hive-service-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
at javax.security.auth.Subject.doAs(Subject.java:423) ~[?:?]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) ~[hadoop-common-3.1.1.7.1.9.4-4.jar:?]
at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:354) ~[hive-service-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.thrift.TApplicationException: Internal error processing abort_txns
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_abort_txns(ThriftHiveMetastore.java:5929) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.abort_txns(ThriftHiveMetastore.java:5916) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.abortTxns(HiveMetaStoreClient.java:3445) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at jdk.internal.reflect.GeneratedMethodAccessor529.invoke(Unknown Source) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:213) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at com.sun.proxy.$Proxy55.abortTxns(Unknown Source) ~[?:?]
at jdk.internal.reflect.GeneratedMethodAccessor529.invoke(Unknown Source) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:3759) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at com.sun.proxy.$Proxy55.abortTxns(Unknown Source) ~[?:?]
at org.apache.hadoop.hive.ql.metadata.Hive.abortTransactions(Hive.java:5546) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
... 26 more
ERROR : DDLTask failed, DDL Operation: class org.apache.hadoop.hive.ql.ddl.process.abort.AbortTransactionsOperation
org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Internal error processing abort_txns
at org.apache.hadoop.hive.ql.metadata.Hive.abortTransactions(Hive.java:5549) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.ddl.process.abort.AbortTransactionsOperation.execute(AbortTransactionsOperation.java:35) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.ddl.DDLTask.execute(DDLTask.java:82) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:357) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:785) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:524) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:518) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:234) ~[hive-service-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:91) ~[hive-service-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:334) ~[hive-service-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
at javax.security.auth.Subject.doAs(Subject.java:423) ~[?:?]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) ~[hadoop-common-3.1.1.7.1.9.4-4.jar:?]
at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:354) ~[hive-service-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.thrift.TApplicationException: Internal error processing abort_txns
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_abort_txns(ThriftHiveMetastore.java:5929) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.abort_txns(ThriftHiveMetastore.java:5916) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.abortTxns(HiveMetaStoreClient.java:3445) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at jdk.internal.reflect.GeneratedMethodAccessor529.invoke(Unknown Source) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:213) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at com.sun.proxy.$Proxy55.abortTxns(Unknown Source) ~[?:?]
at jdk.internal.reflect.GeneratedMethodAccessor529.invoke(Unknown Source) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:3759) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
at com.sun.proxy.$Proxy55.abortTxns(Unknown Source) ~[?:?]
at org.apache.hadoop.hive.ql.metadata.Hive.abortTransactions(Hive.java:5546) ~[hive-exec-3.1.3000.7.1.9.4-4.jar:3.1.3000.7.1.9.4-4]
... 26 more
ERROR : FAILED: Execution Error, return code 40000 from org.apache.hadoop.hive.ql.ddl.DDLTask. org.apache.thrift.TApplicationException: Internal error processing abort_txns
INFO : Completed executing command(queryId=hive_20240822152836_9e17b9ba-4e7b-44d1-a710-8e086edd8da0); Time taken: 1.018 seconds
INFO : OK
 
Any suggestion?
Thanks,
Lorenzo

avatar
Super Collaborator

To determine the cause of the failure, it is recommended to review the HMS logs within the specified time frame as the exception stack-trace does not provide sufficient information.

avatar
Contributor

I deleted the open transactions from the Oracle db. After restarting hive unfortunately I still have the same problems. There are no error messages from the logs and the tables are not locked.

 

INFO org.apache.hadoop.hive.ql.txn.compactor.Cleaner: [Cleaner-executor-thread-0]: Starting cleaning for id:5365402,dbname:XXXX,tableName:XXXX,partName:schema_sorgente=XXXX,state:,type:MAJOR,enqueueTime:0,start:0,properties:null,runAs:hive,tooManyAborts:false,hasOldAbort:false,highestWriteId:826,errorMessage:null,workerId: null,initiatorId: null
2024-08-27 14:26:53,877 WARN org.apache.hadoop.hive.ql.txn.compactor.Cleaner: [Cleaner-executor-thread-0]: id=5365402 Remained 21 obsolete directories from hdfs://XXXX. [base_0000201_v1772045,base_0000014_v1403023,delta_0000002_0000002_0000,delete_delta_0000003_0000003_0000,delta_0000003_0000003_0000,delta_0000004_0000004_0000,delete_delta_0000007_0000007_0000,delta_0000007_0000007_0000,delta_0000008_0000008_0000,delete_delta_0000011_0000011_0000,delta_0000011_0000011_0000,delta_0000012_0000012_0000,delete_delta_0000013_0000013_0000,delta_0000013_0000013_0000,delta_0000014_0000014_0000,delete_delta_0000200_0000200_0000,delta_0000200_0000200_0000,delta_0000201_0000201_0000,delete_delta_0000498_0000498_0000,delta_0000498_0000498_0000,delta_0000499_0000499_0000]
2024-08-27 14:26:53,877 WARN org.apache.hadoop.hive.ql.txn.compactor.Cleaner: [Cleaner-executor-thread-0]: No files were removed. Leaving queue entry id:5365402,dbname:XXXX,tableName:XXXX,partName:schema_sorgente=XXXX,state:,type:MAJOR,enqueueTime:0,start:0,properties:null,runAs:hive,tooManyAborts:false,hasOldAbort:false,highestWriteId:826,errorMessage:null,workerId: null,initiatorId: null in ready for cleaning state.

 

 

avatar
Super Collaborator

Based on the INFO logs, it appears that there is an open transaction blocking the compaction cleaner process. This requires a separate investigation, so I advise raising a support case to resolve the problem. Additionally, we need to examine the HMS logs, backend DB dump, and the output of "hdfs dfs -ls -R" command.

avatar
Expert Contributor
@Lorenzo The issue seems to be related to HIVE-27191 where some mhl_txnids do not exist in TXNS,completed_txn_components txn_components table but they are still present in min_history_level table, as a result, the cleaner gets blocked and many entries are stuck in the ready-for-cleaning state. To confirm that collect the output of below query
 
SELECT MHL_TXNID FROM HIVE.MIN_HISTORY_LEVEL WHERE MHL_MIN_OPEN_TXNID = (SELECT MIN(MHL_MIN_OPEN_TXNID) FROM HIVE.MIN_HISTORY_LEVEL);
 
Once we get the output of the above query check if those txn ids are there in TXNS,completed_txn_components txn_components tables using below commands.
 
select * from txn_components where tc_txnid IN (MHL_TXNID );
select * from completed_txn_components where ctc_txnid IN (MHL_TXNID);
select * from TXNS where ctc_txnid IN (MHL_TXNID);
 
If we got 0 results from the above queries this confirms that the MHL_TXNIDs we got above are orphans and we need to remove them in order to unblock the cleaner.
 
delete from MIN_HISTORY_LEVEL where MHL_TXNID=13422; --(repeat for all)
 
Hope this helps you in resolving the issue