Member since
11-04-2015
179
Posts
24
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
253 | 10-13-2022 08:25 AM | |
665 | 10-13-2022 03:29 AM | |
266 | 10-05-2022 01:44 AM | |
573 | 08-22-2022 01:02 AM | |
376 | 08-02-2022 01:16 AM |
01-19-2023
08:20 AM
Hi @StuartM , I know it's not a direct answer, but this requirement sounds more like a good call for Kafka - which inherently supports the idea of "consumer offsets".
... View more
10-13-2022
08:25 AM
Hi @MichaelPlet , yes, sure, 7.1.7 SP1 is definitely a stable and also a long term support release with lots of bugfixes over 7.1.7. Also check the additional cumulative hotfixes which were release on top of 7.1.7 SP1: https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/runtime-release-notes/topics/chf-pvcb-sp1-overview.html#chf-pvcb-sp1-overview If you have some security vulnerability questions, then kindly raise those questions through a support case. Thank you Miklos Szurap Customer Operations Engineer, Cloudera
... View more
10-13-2022
06:13 AM
These are pretty long GC pauses, I assume they are from the HMS logs. With long GC pauses of course every operation will suffer and will be slow, eventually the SMON's request will time out . Kindly review the HMS heap size and consider to increase it until you get a stable performance (without such GC pauses).
... View more
10-13-2022
03:29 AM
The Canary is just testing whether the basic operatins are working in Hive Metastore. If that shows "unhealthy" it does not necessarily mean that the jobs are failing due to the Hive Metastore not functioning (it may be just slow for example), it is however indeed a warning sign for you that something is not proper. Please connect with beeline to the HiveServer2 and verify what is working and what is failing, then check the HiveServer2 logs and HiveMetastore logs. You can file a support case (where you can share much more details) if this is an urgent issue.
... View more
10-13-2022
01:20 AM
Hi @hanumanth , I assume this is a CDH 6 cluster. Do you have Sentry enabled as well? Is this always happening, or just at some times? Have you tested in beeline how long does it take to drop an example database? Does it also fail with a timeout? I guess it is taking more than 60 seconds (that's the service monitor's default timeout), and since the default timeout for HS2 to HMS is 5 minutes it actually succeeds. Thanks, Miklos
... View more
10-05-2022
01:44 AM
Hi @Jarinek , Yes, in CDH/CDP every service which depends on HDFS will inherit the HDFS configuration " auth-to-local rules", in CM in HDFS Configuration see "Additional Rules to Map Kerberos Principals to Short Names". Kafka does not need HDFS so that's why it has a separate such configuration. See the documentation how to set it: https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/security-kerberos-authentication/topics/cm-security-kerberos-authentication-auth-to-local-isolate.html Best regards Miklos
... View more
08-22-2022
01:02 AM
Hi @Shaswat , Without reviewing completely what (else) may be the problem, the "port=21000" is definitely not correct. Impala has two "frontend" ports to which the clients can connect: - Port 21000 is used only for "impala-shell" - Port 21050 is used for all the other client applications using JDBC, ODBC, Hue or other Python based applications using Impyla - which is also used in the above example. Please see Impyla docs for more. Best regards Miklos
... View more
08-02-2022
01:16 AM
Hi @Neel_Sharma , The message suggests that the query tried to read the table's datafiles as if the table was parquet file based. (and parquet might be the default table format in Hive in CDP - of course only if the table format is not specified during creation) However the table creation script you've shared suggests the table should be text (CSV) based. Can you please verify it with checking what is the table format, with: DESCRIBE FORMATTED GeoIP2_ISP_Blocks_IPv4; Are you in the right database? For the second issue - how do you create the external tables from tab delimited files? How are the files uploaded to hdfs? Thanks Miklos
... View more
06-24-2022
12:36 AM
1 Kudo
Hi, The "Requested array size exceeds VM limit" means that your code tries to instantiate an array which has more than 2^31-1 elements (~2 billion) which is the max size of an array in Java. You cannot solve this with adding more memory. You need to split the work between executors and not process data on a single JVM (Driver side).
... View more
06-23-2022
01:47 AM
1 Kudo
wholeTextFiles is also not a scalable solution. https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.SparkContext.wholeTextFiles.html "Small files are preferred, as each file will be loaded fully in memory."
... View more
06-22-2022
02:45 AM
Hi @Yosieam , Using " collect" method is not recommended as it needs to collect the data to the Spark driver side and as such it needs to fit the whole dataset into the Driver's memory. Please rewrite your code to avoid the "collect" method.
... View more
06-22-2022
02:41 AM
HI @shivam0408 , Can you clarify what CDH/HDP/CDP version are you using and what is the datatype of the "DATETIME" column? What is the desired end result of this command? To drop all the partitions?
... View more
06-17-2022
03:52 AM
Can you review the whole logfile? The above NPE may be just a side effect of another failure before.
... View more
06-17-2022
01:39 AM
Hi @Uday_Singh2022 , yes, Flume is not a supported component in CDP. You can find documentations on Flume on it's official website: https://flume.apache.org/ Have you considered to use CDF / Nifi for this usecase? https://docs.cloudera.com/cdf-datahub/latest/nifi-hbase-ingest/topics/cdf-datahub-nifi-hbase-ingest.html Thanks, Miklos
... View more
06-17-2022
01:34 AM
Hi @PCP2 , can you clarify which HDP/CDH/CDP version are you using? Is this a one-off or an intermittent issue or does it always happen? Is this affecting only a single job? What kind of an action is Oozie trying to launch? Thanks, Miklos
... View more
06-10-2022
12:23 AM
Hi @luckes , Please check if your source code file (test.java) has UTF-8 encoding and how are you compiling the class (for example when using Maven you might need to specify to use utf-8 encoding while compiling the classes. These special characters can be easily lost if somewhere the encoding is not set properly. Alternatively you can use the unicode notation \uXXXX to make sure the character is properly understood by java. For example 张 is: https://www.compart.com/en/unicode/U+5F20 so in source code it looks like statement.setString(2, "\u5f20\u4e09"); Of course it is rare that one needs to hardcode special characters in the source code, usually it is read from a datafile - where you can specify what encoding to use during reading.
... View more
06-09-2022
03:35 AM
1 Kudo
Hi @DataMan-HJ , the requirement you're looking for with case-insensitive joins doesn't seem to be present in Hive and likely will not be implemented as Hive relies on Java's UTF-8 strings and the behavior which implicitly comes with it - without possibility to change the collation. There's a good discussion on HIVE-4070 where a similar ask is raised for the LIKE operator behavior. You can review the pros and cons there. So you will likely need to go ahead to change the individual joins to use the lower/upper functions. Best regards Miklos
... View more
06-09-2022
03:25 AM
Hi @luckes , thanks for reporting this. Based on your descriptinon yes, it seems the upsert is replaced everywhere to insert by the driver. Please open a support case through MyCloudera support portal to have this routed to the proper team for enhancement. Other ideas: - have you checked if this behavior can be observed with the latest JDBC driver version too? - please check if the "UseNativeQuery=1" helps in the JDBC connection string - does it work if you avoid the "insert" from the column ("insert_time") names, so for example with a "modification_time" column name? Thank you Miklos Szurap, Customer Operations Engineer, Cloudera
... View more
06-08-2022
04:08 AM
1 Kudo
Hi Andrea, Great to see that it has been found now and thanks for marking the post as answered. All the best, Miklos
... View more
06-08-2022
01:10 AM
Hi @tallamohan The direct usage of the Hive classes (CliSessionState, SessionState, Driver) in the provided code falls under the "Hive CLI" or "Hcat CLI" access, which is not supported in CDP: https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade/topics/hive-unsupported.html Please open a case on MyCloudera Support Portal to get that clarified. The recommended approach would be to use beeline and access the Hive service through HiveServer2. Best regards Miklos
... View more
06-08-2022
01:04 AM
1 Kudo
Please remember that 1 block is not necessarily 256 MB, it can be less. Also not all files have replica factor of 3, some might have only 1 replica too, so it can be totally fine if all of those were all single replica files. 600.000 * 256 MB = 153.6 TB as a maximum, but since blocks can be smaller than 256 MB, the 60 TB freed up is reasonable.
... View more
06-08-2022
12:44 AM
Please check which CDH version the cluster has. Cloudera ODBC Driver version 2.5.x in general is compatible with CDH 5.x, for CDH 6.x please use the latest 2.6.x version, please check our website / Downloads section. To further triage this: - Try to connect directly to a specific Impala coordinator host instead of the load-balancer - if load-balancer is used - Enable the driver side logging (check the driver's "Installation guide" how to enable it) which can give further clues. - Cross check that the Impala service is indeed SSL enabled, use different "openssl" commands to verify the certificate presented by the service, including the truststore used on the client side Hope this helps, Miklos Szurap, Customer Operations Engineer, Cloudera
... View more
06-03-2022
02:39 AM
1 Kudo
Hi @Amn_468 , The lock contention happens when there are too many "invalidate metadata" (IM) and "refresh" commands running. The catalog daemon's responsibility is to load the Hive Metastore metadata (hive table and partition information, including stats) and the HDFS metadata (list of files and their block locations). If a table is refreshed (or a table is loaded for the first time after an IM) then catalogd has to load these metadata information, and has some built-in limits and has a max throughput how many tables and/or partitions/files it can handle (load). While doing so it needs to maintain a lock on the "catalog update", to avoid simultaneous requests to overwrite the previously collected information. So if there are concurrent and long running "refresh" statements [1], then those can block each other and cause a delay in the publishing of the catalog information. What can be done is to: - reduce the number of IM calls - reduce the number of refresh calls - wherever it is possible, use refresh on partition level only - There were some improvements in IMPALA-6671, which is available in CDP 7.1.7 SP1 version, so an upgrade could also help (it still cannot completely help with high frequency, heavy refreshes) I hope this can help the discussions with the users/teams how frequently and when are they submitting the refresh queries. Miklos Customer Operations Engieer, Cloudera [1] https://impala.apache.org/docs/build3x/html/topics/impala_refresh.html
... View more
06-03-2022
12:48 AM
That is great, thank you for sharing the solution! Best regards Miklos
... View more
06-01-2022
03:21 AM
DN should keep files only which are still managed and known by NN. After a huge deletion event of course these "pending deletes" may take some time to be sent to DNs (and the DNs to delete them), but usually that's not that long. Maybe check the "select pending_deletion_blocks" chart if this is applicable. So if the above are not applicable, then check it more deeply with: - collect a full hdfs fsck -files -blocks -locations output - pick a DN which you think has more blocks than it should - verify how many blocks are reported by the hdfs fsck report for that DN - verify on DN side how many files is it storing - are those numbers matching?
... View more
05-31-2022
07:33 AM
Hi Andrea, Oh, I see, I did not consider that you see this from the DataNodes' perspective. Was this cluster recently upgraded? Is the "Finalize upgrade" step for HDFS is still pending? https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-cdp/topics/ug_cdh_upgrade_hdfs_finalize.html While HDFS upgrade is not finalized, DataNodes keep track of all the previous blocks (including blocks deleted after the upgrade) in case a "rollback" is needed.
... View more
05-31-2022
01:08 AM
Hi, the "hdfs dfs -du" for that path should return the summary of the disk usage (bytes, kbytes, megabytes, etc..) for that given path. Are you sure there are "no lines returned"? Have you checked the "du" output for a smaller subpath (which has less files underneith), does that return results? Can you also clarify where have you checked the block count before and after the deletion? (" the block count among data nodes did not decrease as expected")
... View more
05-30-2022
11:02 AM
Be careful with starting processes as root user, as that may leave some files and directories around owned as root - and then the ordinary "yarn" user (the process stareted by CM) won't be able to write it. For example log files under /var/log/hadoop-yarn/... Please verify that.
... View more
05-30-2022
10:37 AM
Hello @andrea_pretotto , This typically happens if you have snapshots on the system. Even though the "current" files are deleted from HDFS, they may be still hold by one ore more snapshots (which are exactly useful against accidental data deletions, as you can recover data from the snapshots if needed). Please check which HDFS directories are snapshottable: hdfs lsSnapshottableDir and then check how many snapshots do you have under those directories: hdfs dfs -ls /snapshottable_path/.snapshot Probably you can also verify it with checking the output of the "du" which includes the snapshots' sizes: hdfs dfs -du -h -v -s /snapshottable_path vs the same which excludes the snapshots from the calculation: hdfs dfs -du -x -h -v -s /snapshottable_path https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html#du Best regards Miklos Customer Operations Engineer, Cloudera
... View more
05-30-2022
05:44 AM
Have you reviewed the classpath of the HS2 and all the jars? $JAVA_HOME/bin/jinfo <hs2_pid> | grep java.class.path Do they have some classes under the "org.apache.hadoop.hive.ql.ddl" package? The attached code does not work on my cluster (it is missing some tez related configs). What configuration does it require?
... View more