Created 08-14-2017 11:10 AM
I'm trying to load a JSON file from an URL into DataFrame. The data is loaded and parsed correctly into the Python JSON type but passing it as argument to sc.parallelize() throws an Exception:
The Code:
url = "http://api.luftdaten.info/static/v1/data.json" response = urlopen(url) data = str(response.read()) json_data = json.loads(data) json_string = json.dumps(json_data) rdd = sc.parallelize(json_string) df = sqlContext.read.json(rdd)
The Error:
root |-- _corrupt_record: string (nullable = true)<br>
Anyone an Idea what is wrong?
Created 08-15-2017 12:18 PM
If someone else wanna know I've found something that is working for me
def convert_single_object_per_line(json_list): json_string = "" for line in json_list: json_string += json.dumps(line) + "\n" return json_string def parse_dataframe(json_data): r = convert_single_object_per_line(json_data) mylist = [] for line in r.splitlines(): mylist.append(line) rdd = sc.parallelize(mylist) df = sqlContext.jsonRDD(rdd) return df url = "myurl.json" response = urlopen(url) data = str(response.read()) json_data = json.loads(data) df = parse_dataframe(json_data)<br>
Created 08-14-2017 07:36 PM
@Lukas Müller, try below way to create dataframes for data.json
import json import requests r = requests.get("http://api.luftdaten.info/static/v1/data.json") df = sqlContext.createDataFrame([json.loads(line) for line in r.iter_lines()])
Reference: https://stackoverflow.com/questions/32418829/using-pyspark-to-read-json-file-directly-from-a-website
Created 08-15-2017 09:04 AM
Unfortunately this only works if the API returns a single json object per line. I reformatted the data into a string with line breaks and tried to apply this to the inline function. Still doesn't' work.
def convert_single_object_per_line(json_list):
json_string = "" for line in json_list:
json_string += json.dumps(line) + "\n" return json_string
df = sqlContext.createDataFrame([json.loads(line) for line in r.splitlines()])
Created 08-14-2017 08:03 PM
Love to suggest JSON Editor , this wil helps to open or load files/ url , it will helps, to create, update and validate JSON data.
Created 08-15-2017 12:18 PM
If someone else wanna know I've found something that is working for me
def convert_single_object_per_line(json_list): json_string = "" for line in json_list: json_string += json.dumps(line) + "\n" return json_string def parse_dataframe(json_data): r = convert_single_object_per_line(json_data) mylist = [] for line in r.splitlines(): mylist.append(line) rdd = sc.parallelize(mylist) df = sqlContext.jsonRDD(rdd) return df url = "myurl.json" response = urlopen(url) data = str(response.read()) json_data = json.loads(data) df = parse_dataframe(json_data)<br>