Support Questions

Find answers, ask questions, and share your expertise

Issue when using PySpark with Impala via JDBC

avatar
New Contributor

Due to data masking, I can't read tables directly using 'vanilla' Spark. The workaround is connecting Spark to Impala via JDBC and the problem is: when I use reserved words or some operations like `+ INTERVAL 1 DAY` Impala returns the column names as values in the DataFrame.

That's how I start the Spark session:

spark = (
    SparkSession
    .builder
    .config("spark.jars", "/home/cdsw/ImpalaJDBC42.jar")
    .getOrCreate()
)

and how I query data:

(
    spark
    .read
    .format("jdbc")
    .option("driver", "com.cloudera.impala.jdbc.Driver")
    .option("url", "jdbc:impala://MY_IMPALA_HOST:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1")
    .option("PWD", "MY_PASSWORD")
    .option("UID", "MY_USERNAME")
    .option("query", "SELECT 'a' AS index FROM MY_TABLE")
    .load()
    .show()
)

That's what I get:

+-----+
|index|
+-----+
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
+-----+

Other errors are derived from this one. For example, when running the query:

SELECT current_date() + interval 1 day FROM MY_TABLE

raises the exception:

java.sql.SQLDataException: [Cloudera][JDBC](10140) Error converting value to Date.

This happens because Spark is expecting a date to be parsed but Impala returns the column name as a value. We can see the returned value by casting to string:

SELECT CAST(current_date() + interval 1 day AS STRING) FROM MY_TABLE
+-----------------------------------------------+
|cast(current_date() + interval 1 day as string)|
+-----------------------------------------------+
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
|                           cast(current_date...|
+-----------------------------------------------+

Can someone help me? I searched for a while and found some people facing this issue some years ago. Is there a solution already?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hello-

 

Pasting here the reply from 6 yrs ago, which I still find relevant:

Running Impala query over driver from Spark is not currently supported by Cloudera. Why don't you just use SparkSQL instead? Why need to have extra layer of impala here?

View solution in original post

3 REPLIES 3

avatar
Community Manager

@leoeiji Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Impala experts @jAnshula @Saurabhatiyal  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Expert Contributor

Hello-

 

Pasting here the reply from 6 yrs ago, which I still find relevant:

Running Impala query over driver from Spark is not currently supported by Cloudera. Why don't you just use SparkSQL instead? Why need to have extra layer of impala here?

avatar
New Contributor

@Boris G, I literaly started my thread explaining why I need Impala. Problem solved by the way.