Support Questions

soumyabrata_kol · ‎12-10-2016

Hi All,

I am trying to read a valid Json as below through Spark Sql.

{"employees":[
    {"firstName":"John", "lastName":"Doe"},
    {"firstName":"Anna", "lastName":"Smith"},
    {"firstName":"Peter", "lastName":"Jones"}
]}

My Code is like below :

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession \
...     .builder \
...     .appName("Python Spark SQL basic example") \
...     .config("spark.some.config.option", "some-value") \
...     .getOrCreate()
>>> df = spark.read.json("/Users/soumyabrata_kole/Documents/spark_test/employees.json")
>>> df.show()                                                                  
+---------------+---------+--------+
|_corrupt_record|firstName|lastName|
+---------------+---------+--------+
| {"employees":[|     null|    null|
|           null|     John|     Doe|
|           null|     Anna|   Smith|
|           null|    Peter|   Jones|
|             ]}|     null|    null|
+---------------+---------+--------+
>>> df.createOrReplaceTempView("employees")
>>> sqlDF = spark.sql("SELECT * FROM employees")
>>> sqlDF.show()
+---------------+---------+--------+
|_corrupt_record|firstName|lastName|
+---------------+---------+--------+
| {"employees":[|     null|    null|
|           null|     John|     Doe|
|           null|     Anna|   Smith|
|           null|    Peter|   Jones|
|             ]}|     null|    null|
+---------------+---------+--------+
>>>

As per my understanding, there should be only two columns -firstName and lastName. Is it a wrong understanding ?

Why _corrupt_record is coming and how to avoid it ?

Thanks and Regards,

Soumya

rajkumar_singh · ‎12-10-2016

@soumyabrata kole

this is often a problem with multiline json document where during read spark read it as corrupt record

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

if you create a json file with a json document single line it will able to get the schema right.

[spark@rkk1 ~]$ cat sample.json{"employees":[{"firstName":"John", "lastName":"Doe"},{"firstName":"Anna", "lastName":"Smith"},{"firstName":"Peter", "lastName":"Jones"}]}

scala> val dfs = spark.sqlContext.read.json("file:///home/spark/sample.json")dfs: org.apache.spark.sql.DataFrame = [employees: array<struct<firstName:string,lastName:string>>]scala> dfs.printSchemaroot |-- employees: array (nullable = true) |    |-- element: struct (containsNull = true) |    |    |-- firstName: string (nullable = true) |    |    |-- lastName: string (nullable = true)

View solution in original post

rajkumar_singh · ‎12-10-2016

@soumyabrata kole

this is often a problem with multiline json document where during read spark read it as corrupt record

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

if you create a json file with a json document single line it will able to get the schema right.

[spark@rkk1 ~]$ cat sample.json{"employees":[{"firstName":"John", "lastName":"Doe"},{"firstName":"Anna", "lastName":"Smith"},{"firstName":"Peter", "lastName":"Jones"}]}

scala> val dfs = spark.sqlContext.read.json("file:///home/spark/sample.json")dfs: org.apache.spark.sql.DataFrame = [employees: array<struct<firstName:string,lastName:string>>]scala> dfs.printSchemaroot |-- employees: array (nullable = true) |    |-- element: struct (containsNull = true) |    |    |-- firstName: string (nullable = true) |    |    |-- lastName: string (nullable = true)

soumyabrata_kol · ‎12-10-2016

Thanks. From the link below, I found the explanation -

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

rajkumar_singh · ‎12-17-2016

@soumyabrata kole hope the answer in this thread help you, please upvote/accept the best answer so that other can get help from it.

ankur_singh1000 · ‎11-28-2017

How can i dump the corrupted record to some location for future reference

Cloudera Community

Support Questions

Can't read Json properly in Spark

Write / Read Parquet File in Spark

Reading OpenData JSON and Storing into Phoenix Tab...

Spark to read the Hive table sub-directory data

Spark 3 legacy configurations list ( Spark 2 behav...

Spark Python Supportability Matrix

JSON to SQL using Spark

Spark to read the Hive tables under information_sc...

PySpark JSON read with strict schema check and mar...

Spark cannot read Hive managed table

How to read json from S3 then edit json using Nifi