Created 03-18-2025 12:46 PM
Due to data masking, I can't read tables directly using 'vanilla' Spark. The workaround is connecting Spark to Impala via JDBC and the problem is: when I use reserved words or some operations like `+ INTERVAL 1 DAY` Impala returns the column names as values in the DataFrame.
That's how I start the Spark session:
spark = (
SparkSession
.builder
.config("spark.jars", "/home/cdsw/ImpalaJDBC42.jar")
.getOrCreate()
)
and how I query data:
(
spark
.read
.format("jdbc")
.option("driver", "com.cloudera.impala.jdbc.Driver")
.option("url", "jdbc:impala://MY_IMPALA_HOST:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1")
.option("PWD", "MY_PASSWORD")
.option("UID", "MY_USERNAME")
.option("query", "SELECT 'a' AS index FROM MY_TABLE")
.load()
.show()
)
That's what I get:
+-----+
|index|
+-----+
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
+-----+
Other errors are derived from this one. For example, when running the query:
SELECT current_date() + interval 1 day FROM MY_TABLE
raises the exception:
java.sql.SQLDataException: [Cloudera][JDBC](10140) Error converting value to Date.
This happens because Spark is expecting a date to be parsed but Impala returns the column name as a value. We can see the returned value by casting to string:
SELECT CAST(current_date() + interval 1 day AS STRING) FROM MY_TABLE
+-----------------------------------------------+
|cast(current_date() + interval 1 day as string)|
+-----------------------------------------------+
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
+-----------------------------------------------+
Can someone help me? I searched for a while and found some people facing this issue some years ago. Is there a solution already?
Created 03-19-2025 10:13 AM
@leoeiji Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our Impala experts @jAnshula @Saurabhatiyal who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created 03-19-2025 12:39 PM
Hello-
Pasting here the reply from 6 yrs ago, which I still find relevant:
Running Impala query over driver from Spark is not currently supported by Cloudera. Why don't you just use SparkSQL instead? Why need to have extra layer of impala here?
Created 03-20-2025 05:20 AM
@Boris G, I literaly started my thread explaining why I need Impala. Problem solved by the way.
Created 06-01-2025 03:23 AM
Hi @leoeiji Could you please confirm on how did you resolve this issue, I am also facing the same problem.