Support Questions

leoeiji · ‎03-18-2025

Due to data masking, I can't read tables directly using 'vanilla' Spark. The workaround is connecting Spark to Impala via JDBC and the problem is: when I use reserved words or some operations like `+ INTERVAL 1 DAY` Impala returns the column names as values in the DataFrame.

That's how I start the Spark session:

spark = (
    SparkSession
    .builder
    .config("spark.jars", "/home/cdsw/ImpalaJDBC42.jar")
    .getOrCreate()
)

and how I query data:

(
    spark
    .read
    .format("jdbc")
    .option("driver", "com.cloudera.impala.jdbc.Driver")
    .option("url", "jdbc:impala://MY_IMPALA_HOST:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1")
    .option("PWD", "MY_PASSWORD")
    .option("UID", "MY_USERNAME")
    .option("query", "SELECT 'a' AS index FROM MY_TABLE")
    .load()
    .show()
)

That's what I get:

+-----+
|index|
+-----+
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
+-----+

Other errors are derived from this one. For example, when running the query:

SELECT current_date() + interval 1 day FROM MY_TABLE

raises the exception:

java.sql.SQLDataException: [Cloudera][JDBC](10140) Error converting value to Date.

This happens because Spark is expecting a date to be parsed but Impala returns the column name as a value. We can see the returned value by casting to string:

SELECT CAST(current_date() + interval 1 day AS STRING) FROM MY_TABLE

+-----------------------------------------------+
|cast(current_date() + interval 1 day as string)|
+-----------------------------------------------+
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
+-----------------------------------------------+

Can someone help me? I searched for a while and found some people facing this issue some years ago. Is there a solution already?

Boris G · ‎03-19-2025

Hello-

Pasting here the reply from 6 yrs ago, which I still find relevant:

Running Impala query over driver from Spark is not currently supported by Cloudera. Why don't you just use SparkSQL instead? Why need to have extra layer of impala here?

View solution in original post

DianaTorres · ‎03-19-2025

@leoeiji Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Impala experts @jAnshula @Saurabhatiyal who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

Regards,

Diana Torres,
Community Moderator

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

Boris G · ‎03-19-2025

Hello-

Pasting here the reply from 6 yrs ago, which I still find relevant:

Running Impala query over driver from Spark is not currently supported by Cloudera. Why don't you just use SparkSQL instead? Why need to have extra layer of impala here?

leoeiji · ‎03-20-2025

@Boris G, I literaly started my thread explaining why I need Impala. Problem solved by the way.

Cloudera Community

Support Questions

Issue when using PySpark with Impala via JDBC

Hbase filter query using pyspark

Using Teradata JDBC connector in NiFi

Strange permission issue using hive jdbc

SparkSQL jdbc Federation

Impala and unicode (displaying issue)

Impala JDBC is not in maven repo

pyspark using Spark 2.3

Issues using newer versions of python packages su...

Nifi Oracle JDBC Connection Issue

Create Hive table using pyspark: Mkdirs failed to...