Reply
New Contributor
Posts: 1
Registered: ‎09-07-2017

Getting nulls while querying complex types in PySpark

Hi,

 

PySpark behavior is not certain while reading array of structs. Querying the same table returns data in Impala but PySpark SQL returns null for those fields.

 

Example:

Table: table_with_complex_type
Data Type for field arrayofstruct: Array of Struct with all fields with bigint type.

Following Impala query run fine and returns itemid for each record.

Impala Query: select s.id,s.itemlabelhash, s.itemid from table_with_complex_type t, t.arrayofstruct AS s where id=114139563345066193
+--------------------+---------------+------------+
| id | itemlabelhash | itemid |
+--------------------+---------------+------------+
| 114139563345066193 | NULL | 3209141558 |
| 114139563345066193 | NULL | 1011478495 |
| 114139563345066193 | NULL | 3131036211 |
| 114139563345066193 | NULL | 1678301274 |
| 114139563345066193 | NULL | 1907482443 |
| 114139563345066193 | NULL | 1942559899 |
| 114139563345066193 | NULL | 2167129407 |
+--------------------+---------------+------------+

However, similar query in PySpark returns null for itemid

>>> sqlContext.sql("SELECT s.id,s.itemlabelhash, s.itemid from table_with_complex_type t lateral VIEW explode(t.arrayofstruct) tab AS s where id=114139563345066193").show()
+------------------+-------------+------+
| id |itemlabelhash|itemid|
+------------------+-------------+------+
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
+------------------+-------------+------+


Let me know if you need more details. 

 

Anyone faced similar issue. What's the resolution?

Announcements