- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
json file input path for loading into spark
- Labels:
-
Apache Spark
Created ‎05-16-2016 09:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi - i am trying to load my json file using spark and cannot seem to do it correctly. the path at the end of this bit of scala. the file is located on my sandbox in the tmp folder. i've tried:
val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/rawpanda.json")
any help would be great thanks.
mark
Created ‎05-17-2016 12:58 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not sure of what error you are getting (feel free to share some of the dataset and the error messages you received), but I'm wondering if you are accounting for the following warning called out in http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
I usually get something like the following when trying to use a multi-line file.
scala> val productsML = sqlContext.read.json("/tmp/hcc/products.json") productsML: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
That said, all seems to be working for me with a file like the following.
[root@sandbox ~]# hdfs dfs -cat /tmp/hcc/employees.json {"id" : "1201", "name" : "satish", "age" : "25"} {"id" : "1202", "name" : "krishna", "age" : "28"} {"id" : "1203", "name" : "amith", "age" : "39"} {"id" : "1204", "name" : "javed", "age" : "23"} {"id" : "1205", "name" : "prudvi", "age" : "23"}
As you can see by the two ways I read the JSON file below.
SQL context available as sqlContext. scala> val df1 = sqlContext.read.json("/tmp/hcc/employees.json") df1: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string] scala> df1.printSchema() root |-- age: string (nullable = true) |-- id: string (nullable = true) |-- name: string (nullable = true) scala> df1.show() +---+----+-------+ |age| id| name| +---+----+-------+ | 25|1201| satish| | 28|1202|krishna| | 39|1203| amith| | 23|1204| javed| | 23|1205| prudvi| +---+----+-------+ scala> val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/hcc/employees.json") df2: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string] scala> df2.show() +---+----+-------+ |age| id| name| +---+----+-------+ | 25|1201| satish| | 28|1202|krishna| | 39|1203| amith| | 23|1204| javed| | 23|1205| prudvi| +---+----+-------+
Again, if this doesn't help feel free to share some more details. Good luck!
Created ‎05-17-2016 12:58 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not sure of what error you are getting (feel free to share some of the dataset and the error messages you received), but I'm wondering if you are accounting for the following warning called out in http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
I usually get something like the following when trying to use a multi-line file.
scala> val productsML = sqlContext.read.json("/tmp/hcc/products.json") productsML: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
That said, all seems to be working for me with a file like the following.
[root@sandbox ~]# hdfs dfs -cat /tmp/hcc/employees.json {"id" : "1201", "name" : "satish", "age" : "25"} {"id" : "1202", "name" : "krishna", "age" : "28"} {"id" : "1203", "name" : "amith", "age" : "39"} {"id" : "1204", "name" : "javed", "age" : "23"} {"id" : "1205", "name" : "prudvi", "age" : "23"}
As you can see by the two ways I read the JSON file below.
SQL context available as sqlContext. scala> val df1 = sqlContext.read.json("/tmp/hcc/employees.json") df1: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string] scala> df1.printSchema() root |-- age: string (nullable = true) |-- id: string (nullable = true) |-- name: string (nullable = true) scala> df1.show() +---+----+-------+ |age| id| name| +---+----+-------+ | 25|1201| satish| | 28|1202|krishna| | 39|1203| amith| | 23|1204| javed| | 23|1205| prudvi| +---+----+-------+ scala> val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/hcc/employees.json") df2: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string] scala> df2.show() +---+----+-------+ |age| id| name| +---+----+-------+ | 25|1201| satish| | 28|1202|krishna| | 39|1203| amith| | 23|1204| javed| | 23|1205| prudvi| +---+----+-------+
Again, if this doesn't help feel free to share some more details. Good luck!
Created ‎05-18-2016 10:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looks like same question over at https://community.hortonworks.com/questions/33621/input-path-on-sandbox-for-loading-data-into-spark.... that @Joe Widen answered. Note, my comment (and example) below that Joe also pointed out about the JSON object needing to be on a single line. Glad to see Joe got a "best answer" and I'd sure be appreciative for the same on this one. 😉
