Support Questions

mesteph6 · ‎05-16-2016

hi - i am trying to load my json file using spark and cannot seem to do it correctly. the path at the end of this bit of scala. the file is located on my sandbox in the tmp folder. i've tried:

val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/rawpanda.json")

any help would be great thanks.

mark

LesterMartin · ‎05-17-2016

Not sure of what error you are getting (feel free to share some of the dataset and the error messages you received), but I'm wondering if you are accounting for the following warning called out in http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

I usually get something like the following when trying to use a multi-line file.

scala> val productsML = sqlContext.read.json("/tmp/hcc/products.json")
productsML: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

That said, all seems to be working for me with a file like the following.

[root@sandbox ~]# hdfs dfs -cat /tmp/hcc/employees.json
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}

As you can see by the two ways I read the JSON file below.

SQL context available as sqlContext.
scala> val df1 = sqlContext.read.json("/tmp/hcc/employees.json")
df1: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df1.printSchema()
root
 |-- age: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
scala> df1.show()
+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+
scala> val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/hcc/employees.json")
df2: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df2.show()
+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+

Again, if this doesn't help feel free to share some more details. Good luck!

View solution in original post

LesterMartin · ‎05-17-2016