Member since
11-04-2015
143
Posts
17
Kudos Received
14
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
84 | 04-13-2022 02:36 AM | |
190 | 03-30-2022 04:30 AM | |
321 | 02-24-2022 06:13 AM | |
368 | 02-22-2022 04:57 AM | |
141 | 02-07-2022 11:14 AM |
05-19-2022
06:42 AM
Hi! Sorry, but this seems some R specific usage problem in which I cannot help. What you can do is to enable DEBUG/TRACE level logging on the ODBC driver side (please check the ODBC Driver documentation how to do it), maybe there you can find further clues.
... View more
05-18-2022
05:34 AM
Hi @roshanbi , The query itself seems incomplete to me, I do not see where is the alias "a" defined in the a.SUB_SERVICE_CODE_V=b.SUB_SERVICE_CODE_V part. Also it is not clear which is the database name, which is a table, and is there any complex types involved here. Can you run a select on the "cbs_cubes.TB_JDV_CBS_NEW"? (assuming that's a "database.table") Can you run a simple update on it? Are you using the latest Cloudera Impala JDBC driver version? Is the affected table a Kudu based table? Thanks, Miklos
... View more
05-03-2022
04:08 AM
Thanks for checking. Is the connection successful using other clients, like impala-shell, beeline and other JDBC clients?
... View more
05-02-2022
08:36 AM
Hi @gfragkos, thanks for checking. Let's step back then. Is the Impala service TLS/SSL enabled at all? Can you verify that with openssl tools, like: echo | openssl s_client -connect cdp-tdh-de3-master0.cdp-tdh.u5te-1stu.cloudera.site:21050 -CAfile /var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_cacerts.pem
... View more
04-28-2022
08:24 AM
Hello Gozde @gfragkos , Have you checked whether the connectivity works with the given sslTrustStore file with a Java based client? (for example with beeline) As I see your application tries to use unixODBC to connect to a CDP / Impala service. However from the shared connection details I see that the truststore is a Java keystore file (JKS), and since the "nanodbc.cpp" is not a Java based application, it probably cannot recognize that as a valid truststore file. Please try to use a "pem" format trustrstore file instead. Please also review the Impala ODBC Driver documentation: https://downloads.cloudera.com/connectors/impala_odbc_2.6.14.1016/Cloudera-ODBC-Connector-for-Impala-Install-Guide.pdf Thanks Miklos
... View more
04-27-2022
12:54 AM
Hi @jarededrake , that's a good track, the issue currently seems to be that the cluster has Kerberos enabled, and that needs an extra configuration. In the workflow editor, in the right upper corner of the Spark action you will find a cogwheel icon for advanced settings. There on the Credentials tab enable the "hcat" and "hbase" credentials to let the Spark client obtain delegation tokens for the Hive (Hive metastore) and HBase services - in case the spark application wants to use those services (Spark does not know this in advance, so it obtains those DTs). You can disable this behavior too if you are sure that the Spark applicatino will not connect to Hive (using Spark SQL) or HBase, just add the following to the Spark action option list: --conf spark.security.credentials.hadoopfs.enabled=false --conf spark.security.credentials.hbase.enabled=false --conf spark.security.credentials.hive.enabled=false but it's easier to just enable these credentials in the settings page. For similar Kerberos related issues in other actions, please see the following guide: https://gethue.com/hadoop-tutorial-oozie-workflow-credentials-with-a-hive-action-with-kerberos/
... View more
04-26-2022
05:09 AM
Hi @jarededrake , sorry for the delay, I was away for a couple of days. You should use your thin jar (application only - without the dependencies) in the target directory (" SparkTutorial-1.0-SNAPSHOT.jar"). The NoClassDefFoundError for the SparkConf suggests that you've tried a Java action. It is highly suggested to use a Spark action in Oozie workflow editor when running a Spark application to make sure that the environment is set up properly for the application.
... View more
04-14-2022
09:16 AM
So is it " /tmp/kbr5cc_dffe" or "krb5cc_cldr"? Or where do you see the "KRB5CCNAME=/tmp/kbr5cc_dffe"? The "krb5cc_cldr" is used for all (? not sure, but all which I've quickly verified had that) services - we can say it's hardcoded - it is anyways "private" to the process itself, that holds the kerberos ticket cache which only that process is using (and renewing if needed).
... View more
04-14-2022
09:12 AM
I see. Have you verified that the built jar contains this package structure and class names? Can you also show where the jar is uploaded and how is it referenced in the oozie workflow? Thanks, Miklos
... View more
04-14-2022
07:42 AM
Hi, I'm doing well, thank you, hope you're good too. That property usually points to a relative path - which exists in the process directory: KRB5CCNAME='krb5cc_cldr' if that's not the case, I would look into whether the root user's (or maybe the "cloudera-scm" user's) .bashrc file has overridden that KRB5CCNAME environment variable by any chance.
... View more
04-14-2022
01:45 AM
Hi @yagoaparecidoti , in general, the "supervisor.conf" in the process directory (actually the whole process directory) is prepared by Cloudera Manager (server) before starting a process (CM server sends the whole package of information including config files to the CM agent which extracts it in a new process directory). The supervisor.conf file contains all the environment and command related information which is needed for the Supervisor daemon to start the process. There might be some default values taken from the cluster or from the service type. Do you have some specific questions about it?
... View more
04-13-2022
02:36 AM
1 Kudo
Hi @Seaport , the "RegexSerDe" is in the contrib package, which is not supported officially, and as such you can use it in some parts of the platform but the different components may not give you full support for that. I would recommend you to preprocess the datafiles to have a commonly consumable format (CSV) before ingesting them into the cluster. Alternatively you can ingest it into a table which has only a single (string) column, and then do the processing/validation/formatting/transforming of the data with inserting it into a proper final table with the columns you need. During the insert you can still use "regex" or "substring" type of functions / UDFs to extract the fields you need from the fixed-width datafiles (from the table with a single column). I hope this helps, Best regards, Miklos
... View more
04-13-2022
02:03 AM
Hi @jarededrake , The " ClassNotFoundException: Class Hortonwork.SparkTutorial.Main not found" suggests that in the Java program's main class package name might have a typo (in your workflow definiton), the Hortonwork should be Hortonworks. Can you check that?
... View more
03-30-2022
04:30 AM
Hello @Jared , The "ClassNotFoundException" means the JVM responsible for running the code has not found one of the required Java classes - which the code relies on. It's great that you have added those jars to your IntelliJ development environment, however that does not mean it will be available during runtime. One way would be to package all the dependencies in your jar, creating a so called "fat jar", however that is not recommended as with that your application will not be able to benefit from the future bugfixes which will be deployed in the cluster as the cluster is upgraded/patched. That would also have a risk that your application will fail after upgrades due to different class conflicts. The best way is to set up the running environment to have the needed classes. Hue / Java editor actually creates a one-time Oozie workflow with a single Java action in it, however it does not really give you flexibilty around customizing all the parts of this workflow and the running environment including what other jars you need to be shipped with the code. Since your code relies on SparkConf I assume it is actually a Spark based application. It would be a better option to create an Oozie workflow (you can also start from Hue > Apps > Scheduler > change Documents dropdown to Actions) with a Spark action. That will set up all the classpath needed for running Spark apps. That way you do not need to reference any Spark related jars, just the jar with your custom code. Hope this helps. Best regards Miklos
... View more
03-28-2022
02:17 AM
Hello @Sayed016 , In general the java.io.IOException: Filesystem closed message happens when the same or a different thread in the same JVM called the "FileSystem.close()" (see JavaDoc) method - and something later tries to access the HDFS filesystem. (in this case the "EventLoggingListener.stop()" tries to access the HDFS to flush the Spark event logs to HDFS) FileSystem.close() should not be called by any custom code, as there is a single shared instance of the FileSystem object in any given JVM instance and it can cause failures for the still running frameworks like Spark. This suggests that the Spark application has the above FileSystem.close() call somewhere in the code. Please review the code and remove those. Hope that helps. Best regards, Miklos
... View more
03-25-2022
01:27 AM
Hi Rama, yes, you can configure that in the " Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini" . OPSAPS-41615 is still open, in the future you can ask the status from any of your account team contacts. If you don't know who are those contacts, please ask/clarify that through the already open support case. Best regards, Miklos
... View more
03-24-2022
01:29 AM
1 Kudo
Hello @ram76 , You can configure Hue to use the XFF header: [desktop]
use_x_forwarded_host=true See hue.ini reference: https://github.com/cloudera/hue/blob/master/desktop/conf.dist/hue.ini If not already done, besides using an external load-balancer (like F5 - to let the end users remember only a single Hue login URL) please consider to add "Hue Load Balaner" role in CM > Hue service (which sets up an Apache httpd) to serve the static contents. See the following for more: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/hue_use_add_lb.html#hue_use_add_lb Hope this helps. Best regards, Miklos
... View more
03-22-2022
02:22 AM
Hello @mhsyed , thanks for reporting this. I see similar description on our Partner page too: https://www.cloudera.com/downloads/partner/intel.html Seems the link is broken because the "Intel-bigdata" Github does not have the "mkl-wrappers-parcel-repo" anymore: https://github.com/orgs/Intel-bigdata/repositories I have involved our respective teams to get in touch with Intel to fix this. Unfortunately I cannot offer any other workaround in the meantime, we ask for your patience. Best regards Miklos Szurap Customer Operations Engineer, Cloudera
... View more
03-10-2022
12:42 AM
Hi @M129 , the error message is not too descriptive. Can you please check the HiveMetaStore logs what is the complete error message - and reason for the failure? Thanks Miklos
... View more
02-24-2022
06:13 AM
One more item to add to have a complete picture. SparkSQL does not support directly the usage of Hive ACID tables. For that in CDP you can use the Hive Warehouse Connector (HWC), please see: https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/integrating-hive-and-bi/topics/hive_hivewarehouseconnector_for_handling_apache_spark_data.html
... View more
02-23-2022
01:01 AM
1 Kudo
Hi @Rajeshhadoop , The Spark DataSource API has the following syntax: val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:...")...load() Please see: https://spark.apache.org/docs/2.4.0/sql-data-sources-jdbc.html The problem with this approach - using with Hive or Impala is that since the above may run on multiple executors, this could overwhelm and essentially DDOS the Hive / Impala service. As the documentation states this is not a supported way of connecting from Spark to Hive/Impala. However you should be able to still connect to Hive and Impala through a simple JDBC connection using "java.sql.DriverManager" or "java.sql.Connection". That in contrast runs only on a single thread, on the Spark driver side - and will create a single connection to a HiveServer2 / Impala daemon instance. The throughput between the Spark driver and Hive/Impala of course is limited with this approach, please use it for simple queries or submitting DDL/DML queries. Please see https://www.cloudera.com/downloads/connectors/hive/jdbc.html https://www.cloudera.com/downloads/connectors/impala/jdbc.html for the JDBC drivers and for examples. Independently of the above, you can still access Hive tables' data through SparkSQL with val df = spark.sql("select ... from ...") which is the recommended way of accessing and manipulating Hive table data from Spark as it is parallelized through the Spark executors. See docs: https://spark.apache.org/docs/2.4.0/sql-data-sources-hive-tables.html I hope this clarifies it. Best regards Miklos
... View more
02-22-2022
04:57 AM
1 Kudo
Hi @Jmeks ,Please check again how the table has been created or how those partitions were created ("describe formatted <table> partition <partspec>"), as the same is still working for me even on HDP 3.1.0: create table mdatetest (col1 string) partitioned by (`date` date) location '/tmp/md';
alter table mdatetest add partition (`date`="2022-02-22");
show partitions mdatetest;
+------------------+
| partition |
+------------------+
| date=2022-02-22 |
+------------------+
alter table mdatetest drop partition (`date`="2022-02-22");
... View more
02-21-2022
07:50 AM
Hi @Jmeks , Can you clarify which CDH/HDP/CDP version do you have? What is the datatype of that " date" partitioning column? The mentioned syntax works for both string and date datatypes in CDH 6.x.
... View more
02-21-2022
02:38 AM
Hi @Jmeks , I assume the partitioning column for this table is literally named as "date". Please note that "date" is a reserved word in Hive (see the docs for reference), so it is not recommended to use it in table names and column identifiers. If you really need to use it as an identifier, then encapsulate it within backticks. For example: ALTER TABLE customer_transaction_extract DROP IF EXISTS PARTITION (`date`="2019-11-13"); Hope this helps. Best regards, Miklos
... View more
02-08-2022
01:00 AM
1 Kudo
Hi, Do you have any general information about the types of tasks that the driver performs after a job completes? I do not have a comprehensive list of such tasks, this is just what we usually observe through cases and slowness reports. Of course there may be completely different tasks what the driver does - any custom spark code which does not involve parallel execution / data processing might run on the Spark driver side only, for example connecting to an external system through JDBC, or doing some computation (not with RDDs or DataFrames).
... View more
02-07-2022
11:14 AM
1 Kudo
Hi @CRowlett , As I understand you observe that within a single Spark streaming application you see that the individual Spark jobs are quickly running, however there are some unaccounted delays between those. Whenever you see such symptoms, then you need to check what the Spark Driver is doing. The driver may not log every operation it's doing, however an increased - DEBUG level logging may help you understand that. In general you can expect the driver to do additionally (after the Spark executors finish their jobs): - committing files - after new inserts / save operations - refreshing the HDFS file list / file details of the table into which data has been inserted/saved. - alter the table - Hive Metastore table partitions if the table is partitioned, or updating statistics The first two of the above involves HDFS NameNode communication, the third involves HMS communication. To troubleshoot further either enable DEBUG level logging, or collect "jstacks" from the Spark driver. The Jstack is less intrusive, as you do not need to modify any job configuration. From there you will be able to capture what the driver was doing when it was in "hung" state. Based on your description, it seems to me that the "too many files" / "small files" are causing the delays, as Spark driver has to refresh the file listing after inserts to make sure it still has the latest "catalog" information about the table - to be able to reliably continue working with the table. For Spark Streaming / "hot data" you may want to consider to save your data into HBase or to Kudu instead, they are more suitable for those use cases. Best regards Miklos
... View more
02-03-2022
01:25 AM
Hello @grlzz , As mentioned, to test the database connectivity you can use 3rd party tools, like SQuirreL: http://squirrel-sql.sourceforge.net/ The following page describes how to connect to Postgres with Squirrel SQL Client: https://www.cdata.com/kb/tech/postgresql-jdbc-squirrel-sql.rst Please check what's the Postgres DB version and check the corresponding documentation, this can be a good start how to assemble the JDBC connection string: https://jdbc.postgresql.org/documentation/80/connect.html If the connection string works in Squirrel, then use the same connection string in Sqoop too. Hope this helps. Best regards, Miklos
... View more
01-27-2022
08:51 AM
Hi @nagacg , Usually this happens (emphasis also on the "intermittent" nature) when the BI tools like SAS connects to the Impala service (impala coordinator deamons) through a load balancer. With the Load-balancer the different requests are routed to different impala coordinator deamons and likely one of the impala coordinator deamon is in bad health. In this case not all the operations are failing - just some of them, as you've described. It is sometimes not obvious (from Cloudera Manager UI) that an impalad is unhealthy, need to verify all of them by connecting directly to them 1 by 1 with another tool, like impala-shell. (need to verify only the coordinators this way). I would suggest to involve Cloudera Support to assist you with this.
... View more
01-27-2022
08:37 AM
Thanks for collecting and attaching the jstack. Yes, it confirms that Sqoop is trying to connect to the database, however that does not respond (the postgre driver is reading from the socket): Thread 7658: (state = IN_NATIVE)
- java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @BCI=0 (Interpreted frame)
...
- org.postgresql.core.VisibleBufferedInputStream.readMore(int) @BCI=86, line=143 (Interpreted frame)
...
- org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(org.postgresql.core.PGStream, java.lang.String, java.lang.String, java.util.Properties, org.postgresql.core.Logger) @BCI=10, line=376 (Interpreted frame)
- org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(org.postgresql.util.HostSpec[], java.lang.String, java.lang.String, java.util.Properties, org.postgresql.core.Logger) @BCI=675, line=173 (Interpreted frame)
...
- org.postgresql.Driver.makeConnection(java.lang.String, java.util.Properties) @BCI=18, line=393 (Interpreted frame)
- org.postgresql.Driver.connect(java.lang.String, java.util.Properties) @BCI=165, line=267 (Interpreted frame)
- java.sql.DriverManager.getConnection(java.lang.String, java.util.Properties, java.lang.Class) @BCI=171, line=664 (Interpreted frame)
- java.sql.DriverManager.getConnection(java.lang.String, java.lang.String, java.lang.String) @BCI=37, line=247 (Interpreted frame)
- org.apache.sqoop.manager.SqlManager.makeConnection() @BCI=182, line=888 (Interpreted frame)
- org.apache.sqoop.manager.GenericJdbcManager.getConnection() @BCI=10, line=59 (Interpreted frame)
...
- org.apache.sqoop.Sqoop.runSqoop(org.apache.sqoop.Sqoop, java.lang.String[]) @BCI=12, line=187 (Interpreted frame)
Please verify if your connection string is correct with some 3rd party tool - or ask your DBA what is a correct JDBC connection string.
... View more
01-27-2022
05:04 AM
Hi @grlzz , Have you verified the DB connection URL " jdbc:postgresql://[host]:[port]/[db] " - is that working outside of sqoop? (with any external JDBC based tool) Have you used the same PostgreSQL driver which is supposed to be present under /var/lib/sqoop? Also if the sqoop command is really "stuck", please check in another terminal window where is it stuck with jstack: 1. Get the process id of the sqoop command ps -ef | grep sqoop 2. Collect the jstack output - with the same user as the sqoop import is running: /usr/java/latest/bin/jstack $PID This can help to understand what it is trying to do (for example it tries to connect to the database - but maybe the database is ssl enabled?)
... View more