Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Loading a JSON File from URL into a Spark DataFrame with Python

avatar

I'm trying to load a JSON file from an URL into DataFrame. The data is loaded and parsed correctly into the Python JSON type but passing it as argument to sc.parallelize() throws an Exception:

The Code:

url = "http://api.luftdaten.info/static/v1/data.json"
response = urlopen(url)
data = str(response.read())
json_data = json.loads(data)
json_string = json.dumps(json_data)
rdd = sc.parallelize(json_string)
df = sqlContext.read.json(rdd)

The Error:

root 

|-- _corrupt_record: string (nullable = true)<br>

Anyone an Idea what is wrong?

1 ACCEPTED SOLUTION

avatar

If someone else wanna know I've found something that is working for me

def convert_single_object_per_line(json_list):
    json_string = ""
    for line in json_list:
        json_string += json.dumps(line) + "\n"
    return json_string


def parse_dataframe(json_data):
    r = convert_single_object_per_line(json_data)
    mylist = []
    for line in r.splitlines():
        mylist.append(line)
    rdd = sc.parallelize(mylist)
    df = sqlContext.jsonRDD(rdd)
    return df


url = "myurl.json"
response = urlopen(url)
data = str(response.read())
json_data = json.loads(data)
df = parse_dataframe(json_data)<br>

View solution in original post

4 REPLIES 4

avatar
Guru

@Lukas Müller, try below way to create dataframes for data.json

import json
import requests

r = requests.get("http://api.luftdaten.info/static/v1/data.json")
df = sqlContext.createDataFrame([json.loads(line) for line in r.iter_lines()])

Reference: https://stackoverflow.com/questions/32418829/using-pyspark-to-read-json-file-directly-from-a-website

avatar

Unfortunately this only works if the API returns a single json object per line. I reformatted the data into a string with line breaks and tried to apply this to the inline function. Still doesn't' work.

def convert_single_object_per_line(json_list):
json_string = "" for line in json_list:
json_string += json.dumps(line) + "\n" return json_string
df = sqlContext.createDataFrame([json.loads(line) for line in r.splitlines()])

avatar
New Member

Love to suggest JSON Editor , this wil helps to open or load files/ url , it will helps, to create, update and validate JSON data.

avatar

If someone else wanna know I've found something that is working for me

def convert_single_object_per_line(json_list):
    json_string = ""
    for line in json_list:
        json_string += json.dumps(line) + "\n"
    return json_string


def parse_dataframe(json_data):
    r = convert_single_object_per_line(json_data)
    mylist = []
    for line in r.splitlines():
        mylist.append(line)
    rdd = sc.parallelize(mylist)
    df = sqlContext.jsonRDD(rdd)
    return df


url = "myurl.json"
response = urlopen(url)
data = str(response.read())
json_data = json.loads(data)
df = parse_dataframe(json_data)<br>