Hi,
I am using the HiveWarehouseConnector on a HDP 3.0.1 cluster running 3 LLAP daemons.
For the same groupBy + count query, I have inconsistent results from Hive and Spark (Hive giving the right result):
Request executed on the Hive side gives the right result:
from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
analysis_types = hive.executeQuery(
'select value, count(*) as nb '
'from dev.table1 '
'group by value')
z.show(analysis_types)
# Results
+------+------+
| value| nb|
+------+------+
|value1| 45868|
|value2| 2924|
|value3| 40|
|value4|240317|
|value5| 45900|
+------+------+
Request executed on the Spark side (DataFrame) gives an incomplete false result:
hive.table('dev.trc_result_orc').groupBy('analysis_type').count().show()
# Result
+------+------+
| value| nb|
+------+------+
|value1| 27362|
|value2| 1311|
|value3| 36|
|value4|189600|
|value5| 36252|
+------+------+
It seems to me that Spark is using some kind of cached/approximate result. Is it the case? If so how to force Spark to fully execute the query?
Thanks