Support Questions
Find answers, ask questions, and share your expertise

Spark unsupported fearutes in CDP

While looking into the Spark unsupported features in CDP private cloud, i can see below statement 

"Using the JDBC Datasource API to access Hive or Impala is not supported"

Can you please explain this? If JDBC datasource is not supported, can't we access hive? Is this true?

Is there any fix for this.

Appreciate your quick response.

 

2 ACCEPTED SOLUTIONS

Community Manager

@Rajeshhadoop, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

View solution in original post

Expert Contributor

One more item to add to have a complete picture.

SparkSQL does not support directly the usage of Hive ACID tables. For that in CDP you can use the Hive Warehouse Connector (HWC), please see:

https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/integrating-hive-and-bi/topics/hive_hivewareh... 

View solution in original post

3 REPLIES 3

Expert Contributor

Hi @Rajeshhadoop ,

The Spark DataSource API has the following syntax:

val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:...")...load()

Please see:

https://spark.apache.org/docs/2.4.0/sql-data-sources-jdbc.html 

The problem with this approach - using with Hive or Impala is that since the above may run on multiple executors, this could overwhelm and essentially DDOS the Hive / Impala service.  As the documentation states this is not a supported way of connecting from Spark to Hive/Impala.

 

However you should be able to still connect to Hive and Impala through a simple JDBC connection using "java.sql.DriverManager" or "java.sql.Connection". That in contrast runs only on a single thread, on the Spark driver side - and will create a single connection to a HiveServer2 / Impala daemon instance. The throughput between the Spark driver and Hive/Impala of course is limited with this approach, please use it for simple queries or submitting DDL/DML  queries. Please see 

https://www.cloudera.com/downloads/connectors/hive/jdbc.html

https://www.cloudera.com/downloads/connectors/impala/jdbc.html 

for the JDBC drivers and for examples.

 

Independently of the above, you can still access Hive tables' data through SparkSQL with

val df = spark.sql("select ... from ...")

which is the recommended way of accessing and manipulating Hive table data from Spark as it is parallelized through the Spark executors. See docs:

https://spark.apache.org/docs/2.4.0/sql-data-sources-hive-tables.html 

I hope this clarifies it.

Best regards

 Miklos

Community Manager

@Rajeshhadoop, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

Expert Contributor

One more item to add to have a complete picture.

SparkSQL does not support directly the usage of Hive ACID tables. For that in CDP you can use the Hive Warehouse Connector (HWC), please see:

https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/integrating-hive-and-bi/topics/hive_hivewareh... 

; ;