About mszurap

mszurap · ‎02-21-2022

Hi @Jmeks , I assume the partitioning column for this table is literally named as "date". Please note that "date" is a reserved word in Hive (see the docs for reference), so it is not recommended to use it in table names and column identifiers. If you really need to use it as an identifier, then encapsulate it within backticks. For example: ALTER TABLE customer_transaction_extract DROP IF EXISTS PARTITION (`date`="2019-11-13"); Hope this helps. Best regards, Miklos

mszurap · ‎02-08-2022

Hi, Do you have any general information about the types of tasks that the driver performs after a job completes? I do not have a comprehensive list of such tasks, this is just what we usually observe through cases and slowness reports. Of course there may be completely different tasks what the driver does - any custom spark code which does not involve parallel execution / data processing might run on the Spark driver side only, for example connecting to an external system through JDBC, or doing some computation (not with RDDs or DataFrames).

mszurap · ‎02-07-2022

Hi @CRowlett , As I understand you observe that within a single Spark streaming application you see that the individual Spark jobs are quickly running, however there are some unaccounted delays between those. Whenever you see such symptoms, then you need to check what the Spark Driver is doing. The driver may not log every operation it's doing, however an increased - DEBUG level logging may help you understand that. In general you can expect the driver to do additionally (after the Spark executors finish their jobs): - committing files - after new inserts / save operations - refreshing the HDFS file list / file details of the table into which data has been inserted/saved. - alter the table - Hive Metastore table partitions if the table is partitioned, or updating statistics The first two of the above involves HDFS NameNode communication, the third involves HMS communication. To troubleshoot further either enable DEBUG level logging, or collect "jstacks" from the Spark driver. The Jstack is less intrusive, as you do not need to modify any job configuration. From there you will be able to capture what the driver was doing when it was in "hung" state. Based on your description, it seems to me that the "too many files" / "small files" are causing the delays, as Spark driver has to refresh the file listing after inserts to make sure it still has the latest "catalog" information about the table - to be able to reliably continue working with the table. For Spark Streaming / "hot data" you may want to consider to save your data into HBase or to Kudu instead, they are more suitable for those use cases. Best regards Miklos

mszurap · ‎02-03-2022

Hello @grlzz , As mentioned, to test the database connectivity you can use 3rd party tools, like SQuirreL: http://squirrel-sql.sourceforge.net/ The following page describes how to connect to Postgres with Squirrel SQL Client: https://www.cdata.com/kb/tech/postgresql-jdbc-squirrel-sql.rst Please check what's the Postgres DB version and check the corresponding documentation, this can be a good start how to assemble the JDBC connection string: https://jdbc.postgresql.org/documentation/80/connect.html If the connection string works in Squirrel, then use the same connection string in Sqoop too. Hope this helps. Best regards, Miklos

mszurap · ‎01-27-2022

Hi @nagacg , Usually this happens (emphasis also on the "intermittent" nature) when the BI tools like SAS connects to the Impala service (impala coordinator deamons) through a load balancer. With the Load-balancer the different requests are routed to different impala coordinator deamons and likely one of the impala coordinator deamon is in bad health. In this case not all the operations are failing - just some of them, as you've described. It is sometimes not obvious (from Cloudera Manager UI) that an impalad is unhealthy, need to verify all of them by connecting directly to them 1 by 1 with another tool, like impala-shell. (need to verify only the coordinators this way). I would suggest to involve Cloudera Support to assist you with this.

mszurap · ‎01-27-2022

Thanks for collecting and attaching the jstack. Yes, it confirms that Sqoop is trying to connect to the database, however that does not respond (the postgre driver is reading from the socket): Thread 7658: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @BCI=0 (Interpreted frame) ... - org.postgresql.core.VisibleBufferedInputStream.readMore(int) @BCI=86, line=143 (Interpreted frame) ... - org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(org.postgresql.core.PGStream, java.lang.String, java.lang.String, java.util.Properties, org.postgresql.core.Logger) @BCI=10, line=376 (Interpreted frame) - org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(org.postgresql.util.HostSpec[], java.lang.String, java.lang.String, java.util.Properties, org.postgresql.core.Logger) @BCI=675, line=173 (Interpreted frame) ... - org.postgresql.Driver.makeConnection(java.lang.String, java.util.Properties) @BCI=18, line=393 (Interpreted frame) - org.postgresql.Driver.connect(java.lang.String, java.util.Properties) @BCI=165, line=267 (Interpreted frame) - java.sql.DriverManager.getConnection(java.lang.String, java.util.Properties, java.lang.Class) @BCI=171, line=664 (Interpreted frame) - java.sql.DriverManager.getConnection(java.lang.String, java.lang.String, java.lang.String) @BCI=37, line=247 (Interpreted frame) - org.apache.sqoop.manager.SqlManager.makeConnection() @BCI=182, line=888 (Interpreted frame) - org.apache.sqoop.manager.GenericJdbcManager.getConnection() @BCI=10, line=59 (Interpreted frame) ... - org.apache.sqoop.Sqoop.runSqoop(org.apache.sqoop.Sqoop, java.lang.String[]) @BCI=12, line=187 (Interpreted frame) Please verify if your connection string is correct with some 3rd party tool - or ask your DBA what is a correct JDBC connection string.

mszurap · ‎01-27-2022

Hi @grlzz , Have you verified the DB connection URL "jdbc:postgresql://[host]:[port]/[db] " - is that working outside of sqoop? (with any external JDBC based tool) Have you used the same PostgreSQL driver which is supposed to be present under /var/lib/sqoop? Also if the sqoop command is really "stuck", please check in another terminal window where is it stuck with jstack: 1. Get the process id of the sqoop command ps -ef | grep sqoop 2. Collect the jstack output - with the same user as the sqoop import is running: /usr/java/latest/bin/jstack $PID This can help to understand what it is trying to do (for example it tries to connect to the database - but maybe the database is ssl enabled?)

mszurap · ‎01-27-2022

Hi @DataMo , I'm not an expert on this area, however the "Kerberos authentication" through a browser (SPNEGO) is a bit more complex than sending a username/password pair (HTTP Basic authentication). As I see https://issues.apache.org/jira/browse/NIFI-6250 - "Add Kerberos authentication support to InvokeHTTP Processor" is still open.

mszurap · ‎01-11-2022

Hi @Kamil_Valko , The error message suggests that you are hitting IMPALA-9136 which is fixed in CDH 6.3.4 and in CDP 7.1.0 and newer. The regular Catalogd restart is likely due to some other problems, please file a support case to be able to review it in more detail. You can also review the role or the stdout/stderr logs of the Catalog daemon, maybe it shows an OutOfMemoryError - which can be common if the HMS / catalog metadata has increased recently - and the catalog heap is not adjusted. Best regards Miklos

mszurap · ‎01-06-2022

Hi @Donal_RC , Please check if the Spark History Server UI is TLS/SSL enabled or not. I suppose it is not, usually SHS WebUI is available only on http. (the log snippets suggest that it is plain http) Chrome is automatically switching to https for some company domains, this feature is called HSTS - HTTP Strict Transport Security. Please check the addresss bar in Chrome, you should see that the URL starts with https. If you have Internet Explorer, that doesn't do this, you can try there. Alternatively you can: - modify the URL and access the SHS with the host's IP address - or enable TLS/SSL for the SHS too. Hope this helps, Miklos

Online	Offline
Last Visited	‎12-10-2024 10:10 AM

Member Since	‎11-04-2015 11:53 PM
Last Visited	‎12-10-2024 10:10 AM
Posts	260
Kudos received	44

Cloudera Community

Re: Hive fails to start with "Caused by: java.lang...

Re: The heap memory usage of NameNode is much high...

Re: Hue and Sqoop white spaces in query

Re: straight SELECT and SELECT via CTE produce dif...

Re: Best practices for partition tables in Impala ...

Re: Delete partition Hive error

Re: Spark-Streaming stalls at regular intervals

Re: Spark-Streaming stalls at regular intervals

Re: sqoop stuck at No connection paramenters speci...

Re: SAS -Impala connection error : Impala Thrift ...

Re: sqoop stuck at No connection paramenters speci...

Re: sqoop stuck at No connection paramenters speci...

Re: Kudu Metrics - SPENGO Authentication (kerberiz...

Re: Impala Catalog Server restarted with IllegalSt...

Re: Unable to connect to Spark History Server Web ...