Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Query REST API to get a data to Spark Dataframe using Pyspark

Query REST API to get a data to Spark Dataframe using Pyspark

New Contributor

Hello,

i am building a datapipeline which consume data from RESTApi in json format and pushed to Spark Dataframe. Spark Version: 2.4.4

 but getting error as

df = SQLContext.jsonRDD(rdd) 
AttributeError: type object 'SQLContext' has no attribute 'jsonRDD'

 

Code :

 

from pyspark import SparkConf,SparkContext
from pyspark.sql import SparkSession
from urllib import urlopen
from pyspark import SQLContext
import json
spark = SparkSession \
.builder \
.appName("DataCleansing") \
.getOrCreate()


def convert_single_object_per_line(json_list):
json_string = ""
for line in json_list:
json_string += json.dumps(line) + "\n"
return json_string

def parse_dataframe(json_data):
r = convert_single_object_per_line(json_data)
mylist = []
for line in r.splitlines():
mylist.append(line)
rdd = spark.sparkContext.parallelize(mylist)
df = SQLContext.jsonRDD(rdd)
return df

url = "https://mylink"
response = urlopen(url)
data = str(response.read())
json_data = json.loads(data)
df = parse_dataframe(json_data)

 

Techie please help me, if there is any other better way to query RestApi and bring data to Spark Dataframe using Pyspark.

 

If it is not possible in pyspark, can we do it in scala .... Please share your valuable suggestion

Don't have an account?
Coming from Hortonworks? Activate your account here