Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Loading a JSON File from URL into a Spark DataFrame with Python

Solved Go to solution
Highlighted

Loading a JSON File from URL into a Spark DataFrame with Python

New Contributor

I'm trying to load a JSON file from an URL into DataFrame. The data is loaded and parsed correctly into the Python JSON type but passing it as argument to sc.parallelize() throws an Exception:

The Code:

url = "http://api.luftdaten.info/static/v1/data.json"
response = urlopen(url)
data = str(response.read())
json_data = json.loads(data)
json_string = json.dumps(json_data)
rdd = sc.parallelize(json_string)
df = sqlContext.read.json(rdd)

The Error:

root 

|-- _corrupt_record: string (nullable = true)<br>

Anyone an Idea what is wrong?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Loading a JSON File from URL into a Spark DataFrame with Python

New Contributor

If someone else wanna know I've found something that is working for me

def convert_single_object_per_line(json_list):
    json_string = ""
    for line in json_list:
        json_string += json.dumps(line) + "\n"
    return json_string


def parse_dataframe(json_data):
    r = convert_single_object_per_line(json_data)
    mylist = []
    for line in r.splitlines():
        mylist.append(line)
    rdd = sc.parallelize(mylist)
    df = sqlContext.jsonRDD(rdd)
    return df


url = "myurl.json"
response = urlopen(url)
data = str(response.read())
json_data = json.loads(data)
df = parse_dataframe(json_data)<br>
4 REPLIES 4

Re: Loading a JSON File from URL into a Spark DataFrame with Python

Guru

@Lukas Müller, try below way to create dataframes for data.json

import json
import requests

r = requests.get("http://api.luftdaten.info/static/v1/data.json")
df = sqlContext.createDataFrame([json.loads(line) for line in r.iter_lines()])

Reference: https://stackoverflow.com/questions/32418829/using-pyspark-to-read-json-file-directly-from-a-website

Re: Loading a JSON File from URL into a Spark DataFrame with Python

New Contributor

Unfortunately this only works if the API returns a single json object per line. I reformatted the data into a string with line breaks and tried to apply this to the inline function. Still doesn't' work.

def convert_single_object_per_line(json_list):
json_string = "" for line in json_list:
json_string += json.dumps(line) + "\n" return json_string
df = sqlContext.createDataFrame([json.loads(line) for line in r.splitlines()])

Re: Loading a JSON File from URL into a Spark DataFrame with Python

New Contributor

Love to suggest JSON Editor , this wil helps to open or load files/ url , it will helps, to create, update and validate JSON data.

Re: Loading a JSON File from URL into a Spark DataFrame with Python

New Contributor

If someone else wanna know I've found something that is working for me

def convert_single_object_per_line(json_list):
    json_string = ""
    for line in json_list:
        json_string += json.dumps(line) + "\n"
    return json_string


def parse_dataframe(json_data):
    r = convert_single_object_per_line(json_data)
    mylist = []
    for line in r.splitlines():
        mylist.append(line)
    rdd = sc.parallelize(mylist)
    df = sqlContext.jsonRDD(rdd)
    return df


url = "myurl.json"
response = urlopen(url)
data = str(response.read())
json_data = json.loads(data)
df = parse_dataframe(json_data)<br>
Don't have an account?
Coming from Hortonworks? Activate your account here