<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Can't read Json properly in Spark in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147903#M48528</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/2799/soumyabratakole.html" nodeid="2799"&gt;@soumyabrata kole &lt;/A&gt;hope the answer in this thread help you, please upvote/accept the best answer so that other can get help from it.&lt;/P&gt;</description>
    <pubDate>Sat, 17 Dec 2016 16:00:46 GMT</pubDate>
    <dc:creator>rajkumar_singh</dc:creator>
    <dc:date>2016-12-17T16:00:46Z</dc:date>
    <item>
      <title>Can't read Json properly in Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147900#M48525</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;I am trying to read a valid Json as below through Spark Sql.&lt;/P&gt;&lt;PRE&gt;
{"employees":[
    {"firstName":"John", "lastName":"Doe"},
    {"firstName":"Anna", "lastName":"Smith"},
    {"firstName":"Peter", "lastName":"Jones"}
]}&lt;/PRE&gt;&lt;P&gt;My Code is like below :&lt;/P&gt;&lt;PRE&gt;&amp;gt;&amp;gt;&amp;gt; from pyspark.sql import SparkSession
&amp;gt;&amp;gt;&amp;gt; spark = SparkSession \
...     .builder \
...     .appName("Python Spark SQL basic example") \
...     .config("spark.some.config.option", "some-value") \
...     .getOrCreate()
&amp;gt;&amp;gt;&amp;gt; df = spark.read.json("/Users/soumyabrata_kole/Documents/spark_test/employees.json")
&amp;gt;&amp;gt;&amp;gt; df.show()                                                                  
+---------------+---------+--------+
|_corrupt_record|firstName|lastName|
+---------------+---------+--------+
| {"employees":[|     null|    null|
|           null|     John|     Doe|
|           null|     Anna|   Smith|
|           null|    Peter|   Jones|
|             ]}|     null|    null|
+---------------+---------+--------+
&amp;gt;&amp;gt;&amp;gt; df.createOrReplaceTempView("employees")
&amp;gt;&amp;gt;&amp;gt; sqlDF = spark.sql("SELECT * FROM employees")
&amp;gt;&amp;gt;&amp;gt; sqlDF.show()
+---------------+---------+--------+
|_corrupt_record|firstName|lastName|
+---------------+---------+--------+
| {"employees":[|     null|    null|
|           null|     John|     Doe|
|           null|     Anna|   Smith|
|           null|    Peter|   Jones|
|             ]}|     null|    null|
+---------------+---------+--------+
&amp;gt;&amp;gt;&amp;gt; &lt;/PRE&gt;&lt;P&gt;As per my understanding, there should be only two columns -firstName and lastName. Is it a wrong understanding ?&lt;/P&gt;&lt;P&gt;Why _corrupt_record is coming and how to avoid it ?&lt;/P&gt;&lt;P&gt;Thanks and Regards,&lt;/P&gt;&lt;P&gt;Soumya&lt;/P&gt;</description>
      <pubDate>Sat, 10 Dec 2016 15:18:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147900#M48525</guid>
      <dc:creator>soumyabrata_kol</dc:creator>
      <dc:date>2016-12-10T15:18:55Z</dc:date>
    </item>
    <item>
      <title>Re: Can't read Json properly in Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147901#M48526</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/2799/soumyabratakole.html" nodeid="2799"&gt;@soumyabrata kole&lt;/A&gt;&lt;/P&gt;&lt;P&gt;this is often a problem with multiline json document where during read spark read it as corrupt record&lt;/P&gt;&lt;P&gt;&lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json" target="_blank"&gt;http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json&lt;/A&gt;&lt;/P&gt;&lt;P&gt;if you create a json file with a json document single line it will able to get the schema right.&lt;/P&gt;&lt;PRE&gt;[spark@rkk1 ~]$ cat sample.json{"employees":[{"firstName":"John", "lastName":"Doe"},{"firstName":"Anna", "lastName":"Smith"},{"firstName":"Peter", "lastName":"Jones"}]}

scala&amp;gt; val dfs = spark.sqlContext.read.json("file:///home/spark/sample.json")dfs: org.apache.spark.sql.DataFrame = [employees: array&amp;lt;struct&amp;lt;firstName:string,lastName:string&amp;gt;&amp;gt;]scala&amp;gt; dfs.printSchemaroot |-- employees: array (nullable = true) |    |-- element: struct (containsNull = true) |    |    |-- firstName: string (nullable = true) |    |    |-- lastName: string (nullable = true)



&lt;/PRE&gt;</description>
      <pubDate>Sat, 10 Dec 2016 16:45:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147901#M48526</guid>
      <dc:creator>rajkumar_singh</dc:creator>
      <dc:date>2016-12-10T16:45:20Z</dc:date>
    </item>
    <item>
      <title>Re: Can't read Json properly in Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147902#M48527</link>
      <description>&lt;P&gt;Thanks. From the link below, I found the explanation -&lt;/P&gt;&lt;P&gt;&lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets"&gt;http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets&lt;/A&gt;&lt;/P&gt;&lt;PRE&gt;Note that the file that is offered as &lt;EM&gt;a json file&lt;/EM&gt; is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.&lt;/PRE&gt;</description>
      <pubDate>Sat, 10 Dec 2016 17:55:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147902#M48527</guid>
      <dc:creator>soumyabrata_kol</dc:creator>
      <dc:date>2016-12-10T17:55:47Z</dc:date>
    </item>
    <item>
      <title>Re: Can't read Json properly in Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147903#M48528</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/2799/soumyabratakole.html" nodeid="2799"&gt;@soumyabrata kole &lt;/A&gt;hope the answer in this thread help you, please upvote/accept the best answer so that other can get help from it.&lt;/P&gt;</description>
      <pubDate>Sat, 17 Dec 2016 16:00:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147903#M48528</guid>
      <dc:creator>rajkumar_singh</dc:creator>
      <dc:date>2016-12-17T16:00:46Z</dc:date>
    </item>
    <item>
      <title>Re: Can't read Json properly in Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147904#M48529</link>
      <description>&lt;P&gt;How can i dump the corrupted record to some location for future reference&lt;/P&gt;</description>
      <pubDate>Tue, 28 Nov 2017 14:53:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Can-t-read-Json-properly-in-Spark/m-p/147904#M48529</guid>
      <dc:creator>ankur_singh1000</dc:creator>
      <dc:date>2017-11-28T14:53:30Z</dc:date>
    </item>
  </channel>
</rss>

