Support Questions

mesteph6 · ‎05-16-2016

hi - i am trying to load my json file using spark and cannot seem to do it correctly. the path at the end of this bit of scala. the file is located on my sandbox in the tmp folder. i've tried:

val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/rawpanda.json")

any help would be great thanks.

mark

LesterMartin · ‎05-17-2016

Not sure of what error you are getting (feel free to share some of the dataset and the error messages you received), but I'm wondering if you are accounting for the following warning called out in http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

I usually get something like the following when trying to use a multi-line file.

scala> val productsML = sqlContext.read.json("/tmp/hcc/products.json")
productsML: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

That said, all seems to be working for me with a file like the following.

[root@sandbox ~]# hdfs dfs -cat /tmp/hcc/employees.json
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}

As you can see by the two ways I read the JSON file below.

SQL context available as sqlContext.
scala> val df1 = sqlContext.read.json("/tmp/hcc/employees.json")
df1: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df1.printSchema()
root
 |-- age: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
scala> df1.show()
+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+
scala> val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/hcc/employees.json")
df2: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df2.show()
+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+

Again, if this doesn't help feel free to share some more details. Good luck!

View solution in original post

LesterMartin · ‎05-17-2016

Not sure of what error you are getting (feel free to share some of the dataset and the error messages you received), but I'm wondering if you are accounting for the following warning called out in http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

I usually get something like the following when trying to use a multi-line file.

scala> val productsML = sqlContext.read.json("/tmp/hcc/products.json")
productsML: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

That said, all seems to be working for me with a file like the following.

[root@sandbox ~]# hdfs dfs -cat /tmp/hcc/employees.json
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"}

As you can see by the two ways I read the JSON file below.

SQL context available as sqlContext.
scala> val df1 = sqlContext.read.json("/tmp/hcc/employees.json")
df1: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df1.printSchema()
root
 |-- age: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
scala> df1.show()
+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+
scala> val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/hcc/employees.json")
df2: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df2.show()
+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+

Again, if this doesn't help feel free to share some more details. Good luck!

LesterMartin · ‎05-18-2016

Looks like same question over at https://community.hortonworks.com/questions/33621/input-path-on-sandbox-for-loading-data-into-spark.... that @Joe Widen answered. Note, my comment (and example) below that Joe also pointed out about the JSON object needing to be on a single line. Glad to see Joe got a "best answer" and I'd sure be appreciative for the same on this one. 😉

Cloudera Community

Support Questions

json file input path for loading into spark

Converting a Large JSON File into CSV

Pig Load - ERROR 2118: Input path does not exist

load multiple json file to hive table

No files matching path error while loading data in...

How to load json record to postgres as json dataty...

How to Spark Roll Event Log Files in CDP

Spark Load/Performance Testing using Gatling – PAR...

Parsing Apache Log Files with Spark

Apache NIFI SplitJson correct Json path expression...

Converting Nested JSON to Flat JSON using JOLT