Reply
Highlighted
New Contributor
Posts: 2
Registered: ‎09-07-2017

Getting nulls while querying complex types in PySpark

Hi,

 

PySpark behavior is not certain while reading array of structs. Querying the same table returns data in Impala but PySpark SQL returns null for those fields.

 

Example:

Table: table_with_complex_type
Data Type for field arrayofstruct: Array of Struct with all fields with bigint type.

Following Impala query run fine and returns itemid for each record.

Impala Query: select s.id,s.itemlabelhash, s.itemid from table_with_complex_type t, t.arrayofstruct AS s where id=114139563345066193
+--------------------+---------------+------------+
| id | itemlabelhash | itemid |
+--------------------+---------------+------------+
| 114139563345066193 | NULL | 3209141558 |
| 114139563345066193 | NULL | 1011478495 |
| 114139563345066193 | NULL | 3131036211 |
| 114139563345066193 | NULL | 1678301274 |
| 114139563345066193 | NULL | 1907482443 |
| 114139563345066193 | NULL | 1942559899 |
| 114139563345066193 | NULL | 2167129407 |
+--------------------+---------------+------------+

However, similar query in PySpark returns null for itemid

>>> sqlContext.sql("SELECT s.id,s.itemlabelhash, s.itemid from table_with_complex_type t lateral VIEW explode(t.arrayofstruct) tab AS s where id=114139563345066193").show()
+------------------+-------------+------+
| id |itemlabelhash|itemid|
+------------------+-------------+------+
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
+------------------+-------------+------+


Let me know if you need more details. 

 

Anyone faced similar issue. What's the resolution?

New Contributor
Posts: 2
Registered: ‎09-07-2017

Re: Getting nulls while querying complex types in PySpark

figured out the root cause of the issue.

 

Impala reads data from parquet by index and PySpark reads parquet by schema. 

 

Some of the parquet files for a table had sligtly different schema (column name was different) and as result, pyspark couldn't fine the same field. 

 

 

Announcements