Member since
03-18-2025
2
Posts
0
Kudos Received
0
Solutions
03-20-2025
05:20 AM
@Boris G, I literaly started my thread explaining why I need Impala. Problem solved by the way.
... View more
03-18-2025
12:46 PM
Due to data masking, I can't read tables directly using 'vanilla' Spark. The workaround is connecting Spark to Impala via JDBC and the problem is: when I use reserved words or some operations like `+ INTERVAL 1 DAY` Impala returns the column names as values in the DataFrame. That's how I start the Spark session: spark = (
SparkSession
.builder
.config("spark.jars", "/home/cdsw/ImpalaJDBC42.jar")
.getOrCreate()
) and how I query data: (
spark
.read
.format("jdbc")
.option("driver", "com.cloudera.impala.jdbc.Driver")
.option("url", "jdbc:impala://MY_IMPALA_HOST:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1")
.option("PWD", "MY_PASSWORD")
.option("UID", "MY_USERNAME")
.option("query", "SELECT 'a' AS index FROM MY_TABLE")
.load()
.show()
) That's what I get: +-----+
|index|
+-----+
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
|index|
+-----+ Other errors are derived from this one. For example, when running the query: SELECT current_date() + interval 1 day FROM MY_TABLE raises the exception: java.sql.SQLDataException: [Cloudera][JDBC](10140) Error converting value to Date. This happens because Spark is expecting a date to be parsed but Impala returns the column name as a value. We can see the returned value by casting to string: SELECT CAST(current_date() + interval 1 day AS STRING) FROM MY_TABLE +-----------------------------------------------+
|cast(current_date() + interval 1 day as string)|
+-----------------------------------------------+
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
| cast(current_date...|
+-----------------------------------------------+ Can someone help me? I searched for a while and found some people facing this issue some years ago. Is there a solution already?
... View more
Labels:
- Labels:
-
Apache Impala
-
Apache Spark