About mszurap

mszurap · ‎06-08-2022

Hi @tallamohan The direct usage of the Hive classes (CliSessionState, SessionState, Driver) in the provided code falls under the "Hive CLI" or "Hcat CLI" access, which is not supported in CDP: https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade/topics/hive-unsupported.html Please open a case on MyCloudera Support Portal to get that clarified. The recommended approach would be to use beeline and access the Hive service through HiveServer2. Best regards Miklos

mszurap · ‎06-08-2022

Please remember that 1 block is not necessarily 256 MB, it can be less. Also not all files have replica factor of 3, some might have only 1 replica too, so it can be totally fine if all of those were all single replica files. 600.000 * 256 MB = 153.6 TB as a maximum, but since blocks can be smaller than 256 MB, the 60 TB freed up is reasonable.

mszurap · ‎06-03-2022

Hi @Amn_468 , The lock contention happens when there are too many "invalidate metadata" (IM) and "refresh" commands running. The catalog daemon's responsibility is to load the Hive Metastore metadata (hive table and partition information, including stats) and the HDFS metadata (list of files and their block locations). If a table is refreshed (or a table is loaded for the first time after an IM) then catalogd has to load these metadata information, and has some built-in limits and has a max throughput how many tables and/or partitions/files it can handle (load). While doing so it needs to maintain a lock on the "catalog update", to avoid simultaneous requests to overwrite the previously collected information. So if there are concurrent and long running "refresh" statements [1], then those can block each other and cause a delay in the publishing of the catalog information. What can be done is to: - reduce the number of IM calls - reduce the number of refresh calls - wherever it is possible, use refresh on partition level only - There were some improvements in IMPALA-6671, which is available in CDP 7.1.7 SP1 version, so an upgrade could also help (it still cannot completely help with high frequency, heavy refreshes) I hope this can help the discussions with the users/teams how frequently and when are they submitting the refresh queries. Miklos Customer Operations Engieer, Cloudera [1] https://impala.apache.org/docs/build3x/html/topics/impala_refresh.html

mszurap · ‎06-03-2022

That is great, thank you for sharing the solution! Best regards Miklos

mszurap · ‎06-01-2022

DN should keep files only which are still managed and known by NN. After a huge deletion event of course these "pending deletes" may take some time to be sent to DNs (and the DNs to delete them), but usually that's not that long. Maybe check the "select pending_deletion_blocks" chart if this is applicable. So if the above are not applicable, then check it more deeply with: - collect a full hdfs fsck -files -blocks -locations output - pick a DN which you think has more blocks than it should - verify how many blocks are reported by the hdfs fsck report for that DN - verify on DN side how many files is it storing - are those numbers matching?

mszurap · ‎05-31-2022

Hi Andrea, Oh, I see, I did not consider that you see this from the DataNodes' perspective. Was this cluster recently upgraded? Is the "Finalize upgrade" step for HDFS is still pending? https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-cdp/topics/ug_cdh_upgrade_hdfs_finalize.html While HDFS upgrade is not finalized, DataNodes keep track of all the previous blocks (including blocks deleted after the upgrade) in case a "rollback" is needed.

mszurap · ‎05-31-2022

Hi, the "hdfs dfs -du" for that path should return the summary of the disk usage (bytes, kbytes, megabytes, etc..) for that given path. Are you sure there are "no lines returned"? Have you checked the "du" output for a smaller subpath (which has less files underneith), does that return results? Can you also clarify where have you checked the block count before and after the deletion? ("the block count among data nodes did not decrease as expected")

mszurap · ‎05-30-2022

Be careful with starting processes as root user, as that may leave some files and directories around owned as root - and then the ordinary "yarn" user (the process stareted by CM) won't be able to write it. For example log files under /var/log/hadoop-yarn/... Please verify that.

mszurap · ‎05-30-2022

Hello @andrea_pretotto , This typically happens if you have snapshots on the system. Even though the "current" files are deleted from HDFS, they may be still hold by one ore more snapshots (which are exactly useful against accidental data deletions, as you can recover data from the snapshots if needed). Please check which HDFS directories are snapshottable: hdfs lsSnapshottableDir and then check how many snapshots do you have under those directories: hdfs dfs -ls /snapshottable_path/.snapshot Probably you can also verify it with checking the output of the "du" which includes the snapshots' sizes: hdfs dfs -du -h -v -s /snapshottable_path vs the same which excludes the snapshots from the calculation: hdfs dfs -du -x -h -v -s /snapshottable_path https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html#du Best regards Miklos Customer Operations Engineer, Cloudera

mszurap · ‎05-30-2022

Have you reviewed the classpath of the HS2 and all the jars? $JAVA_HOME/bin/jinfo <hs2_pid> | grep java.class.path Do they have some classes under the "org.apache.hadoop.hive.ql.ddl" package? The attached code does not work on my cluster (it is missing some tez related configs). What configuration does it require?

Online	Offline
Last Visited	‎04-24-2026 02:08 AM

Member Since	‎11-04-2015 11:53 PM
Last Visited	‎04-24-2026 02:08 AM
Posts	261
Kudos received	44

Cloudera Community

Re: Hive fails to start with "Caused by: java.lang...

Re: The heap memory usage of NameNode is much high...

Re: Hue and Sqoop white spaces in query

Re: straight SELECT and SELECT via CTE produce dif...

Re: Best practices for partition tables in Impala ...

Re: hive load data query is failing with following...

Re: HDFS block count does not decrease after delet...

Re: Impala Lock contention error while running Ref...

Re: NodeManager fails to start

Re: HDFS block count does not decrease after delet...

Re: HDFS block count does not decrease after delet...

Re: HDFS block count does not decrease after delet...

Re: NodeManager fails to start

Re: HDFS block count does not decrease after delet...

Re: hive load data query is failing with following...