Support Questions

Rajeshhadoop · ‎02-22-2022

While looking into the Spark unsupported features in CDP private cloud, i can see below statement

"Using the JDBC Datasource API to access Hive or Impala is not supported"

Can you please explain this? If JDBC datasource is not supported, can't we access hive? Is this true?

Is there any fix for this.

Appreciate your quick response.

VidyaSargur · ‎02-23-2022

@Rajeshhadoop, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

View solution in original post

mszurap · ‎02-24-2022

One more item to add to have a complete picture.

SparkSQL does not support directly the usage of Hive ACID tables. For that in CDP you can use the Hive Warehouse Connector (HWC), please see:

https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/integrating-hive-and-bi/topics/hive_hivewareh...

View solution in original post

mszurap · ‎02-23-2022

Hi @Rajeshhadoop ,

The Spark DataSource API has the following syntax:

val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:...")...load()

Please see:

https://spark.apache.org/docs/2.4.0/sql-data-sources-jdbc.html

The problem with this approach - using with Hive or Impala is that since the above may run on multiple executors, this could overwhelm and essentially DDOS the Hive / Impala service. As the documentation states this is not a supported way of connecting from Spark to Hive/Impala.

However you should be able to still connect to Hive and Impala through a simple JDBC connection using "java.sql.DriverManager" or "java.sql.Connection". That in contrast runs only on a single thread, on the Spark driver side - and will create a single connection to a HiveServer2 / Impala daemon instance. The throughput between the Spark driver and Hive/Impala of course is limited with this approach, please use it for simple queries or submitting DDL/DML queries. Please see

https://www.cloudera.com/downloads/connectors/hive/jdbc.html

https://www.cloudera.com/downloads/connectors/impala/jdbc.html

for the JDBC drivers and for examples.

Independently of the above, you can still access Hive tables' data through SparkSQL with

val df = spark.sql("select ... from ...")

which is the recommended way of accessing and manipulating Hive table data from Spark as it is parallelized through the Spark executors. See docs:

https://spark.apache.org/docs/2.4.0/sql-data-sources-hive-tables.html

I hope this clarifies it.

Best regards

Miklos

VidyaSargur · ‎02-23-2022