Member since
11-04-2015
261
Posts
44
Kudos Received
33
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 9127 | 05-16-2024 03:10 AM | |
| 4208 | 01-17-2024 01:07 AM | |
| 3641 | 12-11-2023 02:10 AM | |
| 7060 | 10-11-2023 08:42 AM | |
| 4094 | 09-07-2023 01:08 AM |
06-08-2022
01:10 AM
Hi @tallamohan The direct usage of the Hive classes (CliSessionState, SessionState, Driver) in the provided code falls under the "Hive CLI" or "Hcat CLI" access, which is not supported in CDP: https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade/topics/hive-unsupported.html Please open a case on MyCloudera Support Portal to get that clarified. The recommended approach would be to use beeline and access the Hive service through HiveServer2. Best regards Miklos
... View more
06-08-2022
01:04 AM
1 Kudo
Please remember that 1 block is not necessarily 256 MB, it can be less. Also not all files have replica factor of 3, some might have only 1 replica too, so it can be totally fine if all of those were all single replica files. 600.000 * 256 MB = 153.6 TB as a maximum, but since blocks can be smaller than 256 MB, the 60 TB freed up is reasonable.
... View more
06-03-2022
02:39 AM
2 Kudos
Hi @Amn_468 , The lock contention happens when there are too many "invalidate metadata" (IM) and "refresh" commands running. The catalog daemon's responsibility is to load the Hive Metastore metadata (hive table and partition information, including stats) and the HDFS metadata (list of files and their block locations). If a table is refreshed (or a table is loaded for the first time after an IM) then catalogd has to load these metadata information, and has some built-in limits and has a max throughput how many tables and/or partitions/files it can handle (load). While doing so it needs to maintain a lock on the "catalog update", to avoid simultaneous requests to overwrite the previously collected information. So if there are concurrent and long running "refresh" statements [1], then those can block each other and cause a delay in the publishing of the catalog information. What can be done is to: - reduce the number of IM calls - reduce the number of refresh calls - wherever it is possible, use refresh on partition level only - There were some improvements in IMPALA-6671, which is available in CDP 7.1.7 SP1 version, so an upgrade could also help (it still cannot completely help with high frequency, heavy refreshes) I hope this can help the discussions with the users/teams how frequently and when are they submitting the refresh queries. Miklos Customer Operations Engieer, Cloudera [1] https://impala.apache.org/docs/build3x/html/topics/impala_refresh.html
... View more
06-03-2022
12:48 AM
That is great, thank you for sharing the solution! Best regards Miklos
... View more
06-01-2022
03:21 AM
DN should keep files only which are still managed and known by NN. After a huge deletion event of course these "pending deletes" may take some time to be sent to DNs (and the DNs to delete them), but usually that's not that long. Maybe check the "select pending_deletion_blocks" chart if this is applicable. So if the above are not applicable, then check it more deeply with: - collect a full hdfs fsck -files -blocks -locations output - pick a DN which you think has more blocks than it should - verify how many blocks are reported by the hdfs fsck report for that DN - verify on DN side how many files is it storing - are those numbers matching?
... View more
05-31-2022
07:33 AM
Hi Andrea, Oh, I see, I did not consider that you see this from the DataNodes' perspective. Was this cluster recently upgraded? Is the "Finalize upgrade" step for HDFS is still pending? https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-cdp/topics/ug_cdh_upgrade_hdfs_finalize.html While HDFS upgrade is not finalized, DataNodes keep track of all the previous blocks (including blocks deleted after the upgrade) in case a "rollback" is needed.
... View more
05-31-2022
01:08 AM
Hi, the "hdfs dfs -du" for that path should return the summary of the disk usage (bytes, kbytes, megabytes, etc..) for that given path. Are you sure there are "no lines returned"? Have you checked the "du" output for a smaller subpath (which has less files underneith), does that return results? Can you also clarify where have you checked the block count before and after the deletion? ("the block count among data nodes did not decrease as expected")
... View more
05-30-2022
11:02 AM
Be careful with starting processes as root user, as that may leave some files and directories around owned as root - and then the ordinary "yarn" user (the process stareted by CM) won't be able to write it. For example log files under /var/log/hadoop-yarn/... Please verify that.
... View more
05-30-2022
10:37 AM
Hello @andrea_pretotto , This typically happens if you have snapshots on the system. Even though the "current" files are deleted from HDFS, they may be still hold by one ore more snapshots (which are exactly useful against accidental data deletions, as you can recover data from the snapshots if needed). Please check which HDFS directories are snapshottable: hdfs lsSnapshottableDir and then check how many snapshots do you have under those directories: hdfs dfs -ls /snapshottable_path/.snapshot Probably you can also verify it with checking the output of the "du" which includes the snapshots' sizes: hdfs dfs -du -h -v -s /snapshottable_path vs the same which excludes the snapshots from the calculation: hdfs dfs -du -x -h -v -s /snapshottable_path https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html#du Best regards Miklos Customer Operations Engineer, Cloudera
... View more
05-30-2022
05:44 AM
Have you reviewed the classpath of the HS2 and all the jars? $JAVA_HOME/bin/jinfo <hs2_pid> | grep java.class.path Do they have some classes under the "org.apache.hadoop.hive.ql.ddl" package? The attached code does not work on my cluster (it is missing some tez related configs). What configuration does it require?
... View more