Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Getting nulls while querying complex types in PySpark

Getting nulls while querying complex types in PySpark

New Contributor

Hi,

 

PySpark behavior is not certain while reading array of structs. Querying the same table returns data in Impala but PySpark SQL returns null for those fields.

 

Example:

Table: table_with_complex_type
Data Type for field arrayofstruct: Array of Struct with all fields with bigint type.

Following Impala query run fine and returns itemid for each record.

Impala Query: select s.id,s.itemlabelhash, s.itemid from table_with_complex_type t, t.arrayofstruct AS s where id=114139563345066193
+--------------------+---------------+------------+
| id | itemlabelhash | itemid |
+--------------------+---------------+------------+
| 114139563345066193 | NULL | 3209141558 |
| 114139563345066193 | NULL | 1011478495 |
| 114139563345066193 | NULL | 3131036211 |
| 114139563345066193 | NULL | 1678301274 |
| 114139563345066193 | NULL | 1907482443 |
| 114139563345066193 | NULL | 1942559899 |
| 114139563345066193 | NULL | 2167129407 |
+--------------------+---------------+------------+

However, similar query in PySpark returns null for itemid

>>> sqlContext.sql("SELECT s.id,s.itemlabelhash, s.itemid from table_with_complex_type t lateral VIEW explode(t.arrayofstruct) tab AS s where id=114139563345066193").show()
+------------------+-------------+------+
| id |itemlabelhash|itemid|
+------------------+-------------+------+
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
|114139563345066193| null| null|
+------------------+-------------+------+


Let me know if you need more details. 

 

Anyone faced similar issue. What's the resolution?

1 REPLY 1

Re: Getting nulls while querying complex types in PySpark

New Contributor

figured out the root cause of the issue.

 

Impala reads data from parquet by index and PySpark reads parquet by schema. 

 

Some of the parquet files for a table had sligtly different schema (column name was different) and as result, pyspark couldn't fine the same field.