Support Questions

Find answers, ask questions, and share your expertise

Hive - Spark anomaly: Spark recognize different datasets as equal

avatar
New Contributor

Hi all,
I'm new to the community. I have an issue with the interaction of Hive and Spark.
My environment is: HDP 3.1.5.0-152 with Hive 3.1.0 and Spark 2.3.2

 

What’s happens? When I execute two different queries with hiveSession and we have same schema type result datasets, spark does not recognize the datasets as different.


I explan this with an example:

I have two different datasets with same schema type (not columns name):

val ds1 = hiveSession.executeQuery(s"""
        SELECT  year,
		month,
		description,
		1 as count			
        from    example_table_1
        where   year = '2000'	
    """)

val ds2 = hiveSession.executeQuery(s"""
        SELECT  year,
		month,
		des,	                    --different column name	
		count(1) as count           --same type schema
        from example_table_2                --different source table
        where year = '2022'                 --different condition
        group by year			    --group by
    """)

df1.cache
df2.cache

 

if I cache both datasets, when I cache the second one spark return this warning:

22/06/29 07:35:30 WARN CacheManager: Asked to cache already cached data.

We verify same evidence with show datasets. The second dataset appear like the first one.

Note: the anomaly does not occur when hiveSession query is followed by spark transformation

 

Have you also encountered this anomaly?

1 REPLY 1

avatar
Master Collaborator

Hi @fgerardi, please can you help us with repro steps with the sample dataset to test this behavior? thanks.