Hi all, I'm new to the community. I have an issue with the interaction of Hive and Spark. My environment is: HDP 220.127.116.11-152 with Hive 3.1.0 and Spark 2.3.2
What’s happens?When I execute two different queries with hiveSession and we have same schema type result datasets, spark does not recognize the datasets as different.
I explan this with an example:
I have two different datasets with same schema type (not columns name):
val ds1 = hiveSession.executeQuery(s"""
1 as count
where year = '2000'
val ds2 = hiveSession.executeQuery(s"""
des, --different column name
count(1) as count --same type schema
from example_table_2 --different source table
where year = '2022' --different condition
group by year --group by
if I cache both datasets, when I cache the second one spark return this warning:
22/06/29 07:35:30 WARN CacheManager: Asked to cache already cached data.
We verify same evidence with show datasets. The second dataset appear like the first one.
Note: the anomaly does not occur when hiveSession query is followed by spark transformation