question Re: Issue when using PySpark with Impala via JDBC in Support Questions

Issue when using PySpark with Impala via JDBC

leoeiji — Tue, 18 Mar 2025 19:46:17 GMT

Due to data masking, I can't read tables directly using 'vanilla' Spark. The workaround is connecting Spark to Impala via JDBC and the problem is: when I use reserved words or some operations like `+ INTERVAL 1 DAY` Impala returns the column names as values in the DataFrame.

That's how I start the Spark session:

spark = ( SparkSession .builder .config("spark.jars", "/home/cdsw/ImpalaJDBC42.jar") .getOrCreate() )

and how I query data:

( spark .read .format("jdbc") .option("driver", "com.cloudera.impala.jdbc.Driver") .option("url", "jdbc:impala://MY_IMPALA_HOST:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1") .option("PWD", "MY_PASSWORD") .option("UID", "MY_USERNAME") .option("query", "SELECT 'a' AS index FROM MY_TABLE") .load() .show() )

That's what I get:

Other errors are derived from this one. For example, when running the query:

SELECT current_date() + interval 1 day FROM MY_TABLE

raises the exception:

java.sql.SQLDataException: [Cloudera][JDBC](10140) Error converting value to Date.

This happens because Spark is expecting a date to be parsed but Impala returns the column name as a value. We can see the returned value by casting to string:

SELECT CAST(current_date() + interval 1 day AS STRING) FROM MY_TABLE

Can someone help me? I searched for a while and found some people facing this issue some years ago. Is there a solution already?

Re: Issue when using PySpark with Impala via JDBC

DianaTorres — Wed, 19 Mar 2025 17:13:49 GMT

@leoeiji Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Impala experts @jAnshula @Saurabhatiyal who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

Re: Issue when using PySpark with Impala via JDBC

Boris G — Wed, 19 Mar 2025 19:39:31 GMT

Hello-

Pasting here the reply from 6 yrs ago, which I still find relevant:

Running Impala query over driver from Spark is not currently supported by Cloudera. Why don't you just use SparkSQL instead? Why need to have extra layer of impala here?

Re: Issue when using PySpark with Impala via JDBC

leoeiji — Thu, 20 Mar 2025 12:20:54 GMT

@Boris G, I literaly started my thread explaining why I need Impala. Problem solved by the way.

Re: Issue when using PySpark with Impala via JDBC

akb2025 — Sun, 01 Jun 2025 10:23:23 GMT

Hi @leoeiji Could you please confirm on how did you resolve this issue, I am also facing the same problem.