json file input path for loading into spark

mesteph6 — Tue, 17 May 2016 04:05:07 GMT

hi - i am trying to load my json file using spark and cannot seem to do it correctly. the path at the end of this bit of scala. the file is located on my sandbox in the tmp folder. i've tried:

val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/rawpanda.json")

any help would be great thanks.

mark

Re: json file input path for loading into spark

LesterMartin — Tue, 17 May 2016 07:58:11 GMT

Not sure of what error you are getting (feel free to share some of the dataset and the error messages you received), but I'm wondering if you are accounting for the following warning called out in http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

I usually get something like the following when trying to use a multi-line file.

scala> val productsML = sqlContext.read.json("/tmp/hcc/products.json")
productsML: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

That said, all seems to be working for me with a file like the following.

[root@sandbox ~]# hdfs dfs -cat /tmp/hcc/employees.json
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}

As you can see by the two ways I read the JSON file below.

SQL context available as sqlContext.
scala> val df1 = sqlContext.read.json("/tmp/hcc/employees.json")
df1: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df1.printSchema()
root
 |-- age: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
scala> df1.show()
+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+
scala> val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/hcc/employees.json")
df2: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df2.show()
+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+

Again, if this doesn't help feel free to share some more details. Good luck!

Re: json file input path for loading into spark

LesterMartin — Thu, 19 May 2016 05:06:49 GMT

Looks like same question over at https://community.hortonworks.com/questions/33621/input-path-on-sandbox-for-loading-data-into-spark.html that @Joe Widen answered. Note, my comment (and example) below that Joe also pointed out about the JSON object needing to be on a single line. Glad to see Joe got a "best answer" and I'd sure be appreciative for the same on this one. 😉

question json file input path for loading into spark in Archives of Support Questions (Read Only)

json file input path for loading into spark

Re: json file input path for loading into spark

Re: json file input path for loading into spark