About mszurap

mszurap · ‎02-24-2022

One more item to add to have a complete picture. SparkSQL does not support directly the usage of Hive ACID tables. For that in CDP you can use the Hive Warehouse Connector (HWC), please see: https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/integrating-hive-and-bi/topics/hive_hivewarehouseconnector_for_handling_apache_spark_data.html

mszurap · ‎02-23-2022

Hi @Rajeshhadoop , The Spark DataSource API has the following syntax: val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:...")...load() Please see: https://spark.apache.org/docs/2.4.0/sql-data-sources-jdbc.html The problem with this approach - using with Hive or Impala is that since the above may run on multiple executors, this could overwhelm and essentially DDOS the Hive / Impala service. As the documentation states this is not a supported way of connecting from Spark to Hive/Impala. However you should be able to still connect to Hive and Impala through a simple JDBC connection using "java.sql.DriverManager" or "java.sql.Connection". That in contrast runs only on a single thread, on the Spark driver side - and will create a single connection to a HiveServer2 / Impala daemon instance. The throughput between the Spark driver and Hive/Impala of course is limited with this approach, please use it for simple queries or submitting DDL/DML queries. Please see https://www.cloudera.com/downloads/connectors/hive/jdbc.html https://www.cloudera.com/downloads/connectors/impala/jdbc.html for the JDBC drivers and for examples. Independently of the above, you can still access Hive tables' data through SparkSQL with val df = spark.sql("select ... from ...") which is the recommended way of accessing and manipulating Hive table data from Spark as it is parallelized through the Spark executors. See docs: https://spark.apache.org/docs/2.4.0/sql-data-sources-hive-tables.html I hope this clarifies it. Best regards Miklos

mszurap · ‎02-22-2022

Hi @Jmeks ,Please check again how the table has been created or how those partitions were created ("describe formatted <table> partition <partspec>"), as the same is still working for me even on HDP 3.1.0: create table mdatetest (col1 string) partitioned by (`date` date) location '/tmp/md'; alter table mdatetest add partition (`date`="2022-02-22"); show partitions mdatetest; +------------------+ | partition | +------------------+ | date=2022-02-22 | +------------------+ alter table mdatetest drop partition (`date`="2022-02-22");

mszurap · ‎02-21-2022

Hi @Jmeks , Can you clarify which CDH/HDP/CDP version do you have? What is the datatype of that "date" partitioning column? The mentioned syntax works for both string and date datatypes in CDH 6.x.

mszurap · ‎02-21-2022

Hi @Jmeks , I assume the partitioning column for this table is literally named as "date". Please note that "date" is a reserved word in Hive (see the docs for reference), so it is not recommended to use it in table names and column identifiers. If you really need to use it as an identifier, then encapsulate it within backticks. For example: ALTER TABLE customer_transaction_extract DROP IF EXISTS PARTITION (`date`="2019-11-13"); Hope this helps. Best regards, Miklos

mszurap · ‎02-08-2022

Hi, Do you have any general information about the types of tasks that the driver performs after a job completes? I do not have a comprehensive list of such tasks, this is just what we usually observe through cases and slowness reports. Of course there may be completely different tasks what the driver does - any custom spark code which does not involve parallel execution / data processing might run on the Spark driver side only, for example connecting to an external system through JDBC, or doing some computation (not with RDDs or DataFrames).

mszurap · ‎02-07-2022

Hi @CRowlett , As I understand you observe that within a single Spark streaming application you see that the individual Spark jobs are quickly running, however there are some unaccounted delays between those. Whenever you see such symptoms, then you need to check what the Spark Driver is doing. The driver may not log every operation it's doing, however an increased - DEBUG level logging may help you understand that. In general you can expect the driver to do additionally (after the Spark executors finish their jobs): - committing files - after new inserts / save operations - refreshing the HDFS file list / file details of the table into which data has been inserted/saved. - alter the table - Hive Metastore table partitions if the table is partitioned, or updating statistics The first two of the above involves HDFS NameNode communication, the third involves HMS communication. To troubleshoot further either enable DEBUG level logging, or collect "jstacks" from the Spark driver. The Jstack is less intrusive, as you do not need to modify any job configuration. From there you will be able to capture what the driver was doing when it was in "hung" state. Based on your description, it seems to me that the "too many files" / "small files" are causing the delays, as Spark driver has to refresh the file listing after inserts to make sure it still has the latest "catalog" information about the table - to be able to reliably continue working with the table. For Spark Streaming / "hot data" you may want to consider to save your data into HBase or to Kudu instead, they are more suitable for those use cases. Best regards Miklos

mszurap · ‎02-03-2022

Hello @grlzz , As mentioned, to test the database connectivity you can use 3rd party tools, like SQuirreL: http://squirrel-sql.sourceforge.net/ The following page describes how to connect to Postgres with Squirrel SQL Client: https://www.cdata.com/kb/tech/postgresql-jdbc-squirrel-sql.rst Please check what's the Postgres DB version and check the corresponding documentation, this can be a good start how to assemble the JDBC connection string: https://jdbc.postgresql.org/documentation/80/connect.html If the connection string works in Squirrel, then use the same connection string in Sqoop too. Hope this helps. Best regards, Miklos

mszurap · ‎01-27-2022

Hi @nagacg , Usually this happens (emphasis also on the "intermittent" nature) when the BI tools like SAS connects to the Impala service (impala coordinator deamons) through a load balancer. With the Load-balancer the different requests are routed to different impala coordinator deamons and likely one of the impala coordinator deamon is in bad health. In this case not all the operations are failing - just some of them, as you've described. It is sometimes not obvious (from Cloudera Manager UI) that an impalad is unhealthy, need to verify all of them by connecting directly to them 1 by 1 with another tool, like impala-shell. (need to verify only the coordinators this way). I would suggest to involve Cloudera Support to assist you with this.

mszurap · ‎01-27-2022

Thanks for collecting and attaching the jstack. Yes, it confirms that Sqoop is trying to connect to the database, however that does not respond (the postgre driver is reading from the socket): Thread 7658: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @BCI=0 (Interpreted frame) ... - org.postgresql.core.VisibleBufferedInputStream.readMore(int) @BCI=86, line=143 (Interpreted frame) ... - org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(org.postgresql.core.PGStream, java.lang.String, java.lang.String, java.util.Properties, org.postgresql.core.Logger) @BCI=10, line=376 (Interpreted frame) - org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(org.postgresql.util.HostSpec[], java.lang.String, java.lang.String, java.util.Properties, org.postgresql.core.Logger) @BCI=675, line=173 (Interpreted frame) ... - org.postgresql.Driver.makeConnection(java.lang.String, java.util.Properties) @BCI=18, line=393 (Interpreted frame) - org.postgresql.Driver.connect(java.lang.String, java.util.Properties) @BCI=165, line=267 (Interpreted frame) - java.sql.DriverManager.getConnection(java.lang.String, java.util.Properties, java.lang.Class) @BCI=171, line=664 (Interpreted frame) - java.sql.DriverManager.getConnection(java.lang.String, java.lang.String, java.lang.String) @BCI=37, line=247 (Interpreted frame) - org.apache.sqoop.manager.SqlManager.makeConnection() @BCI=182, line=888 (Interpreted frame) - org.apache.sqoop.manager.GenericJdbcManager.getConnection() @BCI=10, line=59 (Interpreted frame) ... - org.apache.sqoop.Sqoop.runSqoop(org.apache.sqoop.Sqoop, java.lang.String[]) @BCI=12, line=187 (Interpreted frame) Please verify if your connection string is correct with some 3rd party tool - or ask your DBA what is a correct JDBC connection string.

Online	Offline
Last Visited	‎04-24-2026 02:08 AM

Member Since	‎11-04-2015 11:53 PM
Last Visited	‎04-24-2026 02:08 AM
Posts	261
Kudos received	44

Cloudera Community

Re: Hive fails to start with "Caused by: java.lang...

Re: The heap memory usage of NameNode is much high...

Re: Hue and Sqoop white spaces in query

Re: straight SELECT and SELECT via CTE produce dif...

Re: Best practices for partition tables in Impala ...

Re: Spark unsupported fearutes in CDP

Re: Spark unsupported fearutes in CDP

Re: Delete partition Hive error

Re: Delete partition Hive error

Re: Delete partition Hive error

Re: Spark-Streaming stalls at regular intervals

Re: Spark-Streaming stalls at regular intervals

Re: sqoop stuck at No connection paramenters speci...

Re: SAS -Impala connection error : Impala Thrift ...

Re: sqoop stuck at No connection paramenters speci...