Created 02-22-2022 10:48 PM
While looking into the Spark unsupported features in CDP private cloud, i can see below statement
"Using the JDBC Datasource API to access Hive or Impala is not supported"
Can you please explain this? If JDBC datasource is not supported, can't we access hive? Is this true?
Is there any fix for this.
Appreciate your quick response.
Created 02-23-2022 10:04 PM
@Rajeshhadoop, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
Regards,
Vidya Sargur,Created 02-24-2022 06:13 AM
One more item to add to have a complete picture.
SparkSQL does not support directly the usage of Hive ACID tables. For that in CDP you can use the Hive Warehouse Connector (HWC), please see:
Created 02-23-2022 01:01 AM
Hi @Rajeshhadoop ,
The Spark DataSource API has the following syntax:
val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:...")...load()
Please see:
https://spark.apache.org/docs/2.4.0/sql-data-sources-jdbc.html
The problem with this approach - using with Hive or Impala is that since the above may run on multiple executors, this could overwhelm and essentially DDOS the Hive / Impala service. As the documentation states this is not a supported way of connecting from Spark to Hive/Impala.
However you should be able to still connect to Hive and Impala through a simple JDBC connection using "java.sql.DriverManager" or "java.sql.Connection". That in contrast runs only on a single thread, on the Spark driver side - and will create a single connection to a HiveServer2 / Impala daemon instance. The throughput between the Spark driver and Hive/Impala of course is limited with this approach, please use it for simple queries or submitting DDL/DML queries. Please see
https://www.cloudera.com/downloads/connectors/hive/jdbc.html
https://www.cloudera.com/downloads/connectors/impala/jdbc.html
for the JDBC drivers and for examples.
Independently of the above, you can still access Hive tables' data through SparkSQL with
val df = spark.sql("select ... from ...")
which is the recommended way of accessing and manipulating Hive table data from Spark as it is parallelized through the Spark executors. See docs:
https://spark.apache.org/docs/2.4.0/sql-data-sources-hive-tables.html
I hope this clarifies it.
Best regards
Miklos
Created 02-23-2022 10:04 PM
@Rajeshhadoop, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
Regards,
Vidya Sargur,Created 02-24-2022 06:13 AM
One more item to add to have a complete picture.
SparkSQL does not support directly the usage of Hive ACID tables. For that in CDP you can use the Hive Warehouse Connector (HWC), please see: