Member since
06-29-2022
1
Post
0
Kudos Received
0
Solutions
06-29-2022
12:58 AM
Hi all, I'm new to the community. I have an issue with the interaction of Hive and Spark. My environment is: HDP 3.1.5.0-152 with Hive 3.1.0 and Spark 2.3.2 What’s happens? When I execute two different queries with hiveSession and we have same schema type result datasets, spark does not recognize the datasets as different. I explan this with an example: I have two different datasets with same schema type (not columns name): val ds1 = hiveSession.executeQuery(s"""
SELECT year,
month,
description,
1 as count
from example_table_1
where year = '2000'
""")
val ds2 = hiveSession.executeQuery(s"""
SELECT year,
month,
des, --different column name
count(1) as count --same type schema
from example_table_2 --different source table
where year = '2022' --different condition
group by year --group by
""")
df1.cache
df2.cache if I cache both datasets, when I cache the second one spark return this warning: 22/06/29 07:35:30 WARN CacheManager: Asked to cache already cached data. We verify same evidence with show datasets. The second dataset appear like the first one. Note: the anomaly does not occur when hiveSession query is followed by spark transformation Have you also encountered this anomaly?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark