Hi all,
I'm new to the community. I have an issue with the interaction of Hive and Spark.
My environment is: HDP 3.1.5.0-152 with Hive 3.1.0 and Spark 2.3.2
What’s happens? When I execute two different queries with hiveSession and we have same schema type result datasets, spark does not recognize the datasets as different.
I explan this with an example:
I have two different datasets with same schema type (not columns name):
val ds1 = hiveSession.executeQuery(s"""
SELECT year,
month,
description,
1 as count
from example_table_1
where year = '2000'
""")
val ds2 = hiveSession.executeQuery(s"""
SELECT year,
month,
des, --different column name
count(1) as count --same type schema
from example_table_2 --different source table
where year = '2022' --different condition
group by year --group by
""")
df1.cache
df2.cache
if I cache both datasets, when I cache the second one spark return this warning:
22/06/29 07:35:30 WARN CacheManager: Asked to cache already cached data.
We verify same evidence with show datasets. The second dataset appear like the first one.
Note: the anomaly does not occur when hiveSession query is followed by spark transformation
Have you also encountered this anomaly?