Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Warehouse connector gives incorrect results on Decimal data type.

Highlighted

Spark Warehouse connector gives incorrect results on Decimal data type.

Explorer

Hi community. 

We have been witnessing some incorrect behavior in how spark hive warehouse connector is pulling data back from hive when the Hive data type is defined as Decimal. 

We did suspect the issue to be related to decimal precision but have have confirmed that the issue

persists even for decimal types (18,0) and lower 

We note that hive vectorization is influencing the behaviour and have confirmed that when hive vectorization is disabled we can extract correct results from hive . 

 

Can any one shed some light on this as yet undocumented bug  in spark-hwx performance .

See below for a quick demonstration.:

 

 

# define the table in hive and query 

 

0: jdbc:hive2://bmhdp3-worker02.example.com> create table test_decimal (id string, value decimal(32,0));
0: jdbc:hive2://bmhdp3-worker02.example.com> insert into test_decimal values ('a1', 1), ('b1', 2),('c1', 3),('a2', 1),('b2', 2),('c2', 3),('a3', 1),('b3', 2),('c3', 3);

0: jdbc:hive2://bmhdp3-worker02.example.com> select * from test_decimal;

+------------------------+---------------------------+
| test_decimal.id  | test_decimal.value  |
+------------------------+---------------------------+
| a1                     | 1                         |
| b1                     | 2                         |
| c1                     | 3                         |
| a2                     | 1                         |
| b2                     | 2                         |
| c2                     | 3                         |
| a3                     | 1                         |
| b3                     | 2                         |
| c3                     | 3                         |
+------------------------+---------------------------+

0: jdbc:hive2://bmhdp3-worker02.example.com> select * from test_decimal where value in (2,3)

+------------------------+---------------------------+
| test_decimal.id  | test_decimal.value  |
+------------------------+---------------------------+
| b1                     | 2                         |
| c1                     | 3                         |
| b2                     | 2                         |
| c2                     | 3                         |
| b3                     | 2                         |
| c3                     | 3                         |
+------------------------+---------------------------+

 

 

Now in the pyspark session

 showing incorrect results on the value field but correct  id records are returned .  

 

>>> from pyspark_llap import HiveWarehouseSession
>>> hive = HiveWarehouseSession.session(spark).build()
>>> hive.executeQuery('select * from test_decimal where value in (2,3)').show()
20/02/18 17:00:43 WARN TaskSetManager: Stage 2 contains a task of very large size (457 KB). The maximum recommended task size is 100 KB.
+---+-----+                                                                     
| id|value|
+---+-----+
| b1|    1|
| c1|    2|
| b2|    3|
| c2|    1|
| b3|    2|
| c3|    3|
+---+-----+

 

 

another example in pyspark 

showing a query on the id field but again value field is scrambled .. 

 

>>> hive.executeQuery('select * from test_decimal where id in ("b1", "b2")').show()
20/02/18 17:02:33 WARN TaskSetManager: Stage 3 contains a task of very large size (457 KB). The maximum recommended task size is 100 KB.
+---+-----+
| id|value|
+---+-----+
| b1|    1|
| b2|    2|
+---+-----+

 

 

another example with a simple where clause involving no 'OR'

Once again we have spurious records in the value column.

 

>>> hive.executeQuery('select * from test_decimal where value == 2 ').show()
20/02/18 18:56:12 WARN TaskSetManager: Stage 39 contains a task of very large size (459 KB). The maximum recommended task size is 100 KB.
+---+-----+
| id|value|
+---+-----+
| b1|    2|
| b2|    0|
| b3|    0|
+---+-----+


## showing full table scan...  no issues 

>>> hive.executeQuery('select * from test_decimal').show()
20/02/18 18:56:35 WARN TaskSetManager: Stage 40 contains a task of very large size (453 KB). The maximum recommended task size is 100 KB.
+---+-----+
| id|value|
+---+-----+
| a1|    1|
| b1|    2|
| c1|    3|
| a2|    1|
| b2|    2|
| c2|    3|
| a3|    1|
| b3|    2|
| c3|    3|
+---+-----+

### showing schema

>>> hive.executeQuery('select * from test_decimal').printSchema()
root
 |-- id: string (nullable = true)
 |-- value: decimal(32,0) (nullable = true)

>>> 

 

 

Further information 

spark version

 

 

>>> sc.version
u'2.3.2.3.1.4.0-315'

 

 

Hive Version

 

 

[bmdev@bmhdp3-edge01 ~]$ hive --version
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-315/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-315/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See <a href="http://www.slf4j.org/codes.html#multiple_bindings" target="_blank">http://www.slf4j.org/codes.html#multiple_bindings</a> for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive 3.1.0.3.1.4.0-315
Git git://ctr-e139-1542663976389-113618-01-000003.hwx.site/grid/0/jenkins/workspace/HDP-parallel-centos7/SOURCES/hive -r e8d79f440455fa4400daf79974666b3055f1730f
Compiled by jenkins on Fri Aug 23 05:16:38 UTC 2019
From source with checksum 0321d07fd607c216351462c714d08b6a
[bmdev@bmhdp3-edge01 ~]$ 

 

 HWX-version

 

/usr/hdp/3.1.4.0-315/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar

 

 

 

Don't have an account?
Coming from Hortonworks? Activate your account here