Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Can't read Json properly in Spark

avatar

Hi All,

I am trying to read a valid Json as below through Spark Sql.

{"employees":[
    {"firstName":"John", "lastName":"Doe"},
    {"firstName":"Anna", "lastName":"Smith"},
    {"firstName":"Peter", "lastName":"Jones"}
]}

My Code is like below :

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession \
...     .builder \
...     .appName("Python Spark SQL basic example") \
...     .config("spark.some.config.option", "some-value") \
...     .getOrCreate()
>>> df = spark.read.json("/Users/soumyabrata_kole/Documents/spark_test/employees.json")
>>> df.show()                                                                  
+---------------+---------+--------+
|_corrupt_record|firstName|lastName|
+---------------+---------+--------+
| {"employees":[|     null|    null|
|           null|     John|     Doe|
|           null|     Anna|   Smith|
|           null|    Peter|   Jones|
|             ]}|     null|    null|
+---------------+---------+--------+
>>> df.createOrReplaceTempView("employees")
>>> sqlDF = spark.sql("SELECT * FROM employees")
>>> sqlDF.show()
+---------------+---------+--------+
|_corrupt_record|firstName|lastName|
+---------------+---------+--------+
| {"employees":[|     null|    null|
|           null|     John|     Doe|
|           null|     Anna|   Smith|
|           null|    Peter|   Jones|
|             ]}|     null|    null|
+---------------+---------+--------+
>>> 

As per my understanding, there should be only two columns -firstName and lastName. Is it a wrong understanding ?

Why _corrupt_record is coming and how to avoid it ?

Thanks and Regards,

Soumya

1 ACCEPTED SOLUTION

avatar
Super Guru

@soumyabrata kole

this is often a problem with multiline json document where during read spark read it as corrupt record

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

if you create a json file with a json document single line it will able to get the schema right.

[spark@rkk1 ~]$ cat sample.json{"employees":[{"firstName":"John", "lastName":"Doe"},{"firstName":"Anna", "lastName":"Smith"},{"firstName":"Peter", "lastName":"Jones"}]}

scala> val dfs = spark.sqlContext.read.json("file:///home/spark/sample.json")dfs: org.apache.spark.sql.DataFrame = [employees: array<struct<firstName:string,lastName:string>>]scala> dfs.printSchemaroot |-- employees: array (nullable = true) |    |-- element: struct (containsNull = true) |    |    |-- firstName: string (nullable = true) |    |    |-- lastName: string (nullable = true)



View solution in original post

4 REPLIES 4

avatar
Super Guru

@soumyabrata kole

this is often a problem with multiline json document where during read spark read it as corrupt record

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

if you create a json file with a json document single line it will able to get the schema right.

[spark@rkk1 ~]$ cat sample.json{"employees":[{"firstName":"John", "lastName":"Doe"},{"firstName":"Anna", "lastName":"Smith"},{"firstName":"Peter", "lastName":"Jones"}]}

scala> val dfs = spark.sqlContext.read.json("file:///home/spark/sample.json")dfs: org.apache.spark.sql.DataFrame = [employees: array<struct<firstName:string,lastName:string>>]scala> dfs.printSchemaroot |-- employees: array (nullable = true) |    |-- element: struct (containsNull = true) |    |    |-- firstName: string (nullable = true) |    |    |-- lastName: string (nullable = true)



avatar

Thanks. From the link below, I found the explanation -

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

avatar
Super Guru

@soumyabrata kole hope the answer in this thread help you, please upvote/accept the best answer so that other can get help from it.

avatar

How can i dump the corrupted record to some location for future reference