Created 12-10-2016 07:18 AM
Hi All,
I am trying to read a valid Json as below through Spark Sql.
{"employees":[ {"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"} ]}
My Code is like below :
>>> from pyspark.sql import SparkSession >>> spark = SparkSession \ ... .builder \ ... .appName("Python Spark SQL basic example") \ ... .config("spark.some.config.option", "some-value") \ ... .getOrCreate() >>> df = spark.read.json("/Users/soumyabrata_kole/Documents/spark_test/employees.json") >>> df.show() +---------------+---------+--------+ |_corrupt_record|firstName|lastName| +---------------+---------+--------+ | {"employees":[| null| null| | null| John| Doe| | null| Anna| Smith| | null| Peter| Jones| | ]}| null| null| +---------------+---------+--------+ >>> df.createOrReplaceTempView("employees") >>> sqlDF = spark.sql("SELECT * FROM employees") >>> sqlDF.show() +---------------+---------+--------+ |_corrupt_record|firstName|lastName| +---------------+---------+--------+ | {"employees":[| null| null| | null| John| Doe| | null| Anna| Smith| | null| Peter| Jones| | ]}| null| null| +---------------+---------+--------+ >>>
As per my understanding, there should be only two columns -firstName and lastName. Is it a wrong understanding ?
Why _corrupt_record is coming and how to avoid it ?
Thanks and Regards,
Soumya
Created 12-10-2016 08:45 AM
this is often a problem with multiline json document where during read spark read it as corrupt record
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
if you create a json file with a json document single line it will able to get the schema right.
[spark@rkk1 ~]$ cat sample.json{"employees":[{"firstName":"John", "lastName":"Doe"},{"firstName":"Anna", "lastName":"Smith"},{"firstName":"Peter", "lastName":"Jones"}]} scala> val dfs = spark.sqlContext.read.json("file:///home/spark/sample.json")dfs: org.apache.spark.sql.DataFrame = [employees: array<struct<firstName:string,lastName:string>>]scala> dfs.printSchemaroot |-- employees: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- firstName: string (nullable = true) | | |-- lastName: string (nullable = true)
Created 12-10-2016 08:45 AM
this is often a problem with multiline json document where during read spark read it as corrupt record
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
if you create a json file with a json document single line it will able to get the schema right.
[spark@rkk1 ~]$ cat sample.json{"employees":[{"firstName":"John", "lastName":"Doe"},{"firstName":"Anna", "lastName":"Smith"},{"firstName":"Peter", "lastName":"Jones"}]} scala> val dfs = spark.sqlContext.read.json("file:///home/spark/sample.json")dfs: org.apache.spark.sql.DataFrame = [employees: array<struct<firstName:string,lastName:string>>]scala> dfs.printSchemaroot |-- employees: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- firstName: string (nullable = true) | | |-- lastName: string (nullable = true)
Created 12-10-2016 09:55 AM
Thanks. From the link below, I found the explanation -
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
Created 12-17-2016 08:00 AM
@soumyabrata kole hope the answer in this thread help you, please upvote/accept the best answer so that other can get help from it.
Created 11-28-2017 06:53 AM
How can i dump the corrupted record to some location for future reference