Support Questions
Find answers, ask questions, and share your expertise

HiveWarehouseConnector: getting incomplete results in Spark

HiveWarehouseConnector: getting incomplete results in Spark

New Contributor

Hi,

I am using the HiveWarehouseConnector on a HDP 3.0.1 cluster running 3 LLAP daemons.


For the same groupBy + count query, I have inconsistent results from Hive and Spark (Hive giving the right result):

Request executed on the Hive side gives the right result:

from pyspark_llap import HiveWarehouseSession

hive = HiveWarehouseSession.session(spark).build()
analysis_types = hive.executeQuery(
    'select value, count(*) as nb '
    'from dev.table1 '
    'group by value')
z.show(analysis_types)
# Results
+------+------+
| value|    nb|
+------+------+
|value1| 45868|
|value2|  2924|
|value3|    40|
|value4|240317|
|value5| 45900|
+------+------+

Request executed on the Spark side (DataFrame) gives an incomplete false result:

hive.table('dev.trc_result_orc').groupBy('analysis_type').count().show()
# Result
+------+------+
| value|    nb|
+------+------+
|value1| 27362|
|value2|  1311|
|value3|    36|
|value4|189600|
|value5| 36252|
+------+------+


It seems to me that Spark is using some kind of cached/approximate result. Is it the case? If so how to force Spark to fully execute the query?

Thanks

1 REPLY 1

Re: HiveWarehouseConnector: getting incomplete results in Spark

Explorer

what data type is the nb column?