About fgerardi

fgerardi · ‎06-29-2022

Hi all, I'm new to the community. I have an issue with the interaction of Hive and Spark. My environment is: HDP 3.1.5.0-152 with Hive 3.1.0 and Spark 2.3.2 What’s happens? When I execute two different queries with hiveSession and we have same schema type result datasets, spark does not recognize the datasets as different. I explan this with an example: I have two different datasets with same schema type (not columns name): val ds1 = hiveSession.executeQuery(s""" SELECT year, month, description, 1 as count from example_table_1 where year = '2000' """) val ds2 = hiveSession.executeQuery(s""" SELECT year, month, des, --different column name count(1) as count --same type schema from example_table_2 --different source table where year = '2022' --different condition group by year --group by """) df1.cache df2.cache if I cache both datasets, when I cache the second one spark return this warning: 22/06/29 07:35:30 WARN CacheManager: Asked to cache already cached data. We verify same evidence with show datasets. The second dataset appear like the first one. Note: the anomaly does not occur when hiveSession query is followed by spark transformation Have you also encountered this anomaly?

Online	Offline
Last Visited	‎07-04-2022 12:16 PM

Member Since	‎06-29-2022 12:13 AM
Last Visited	‎07-04-2022 12:16 PM
Posts	1

Cloudera Community

Hive - Spark anomaly: Spark recognize different da...