Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to create Spark dataframe from python dictionary object?

Highlighted

How to create Spark dataframe from python dictionary object?

Super Collaborator

Hi Guys,

I want to create a Spark dataframe from the python dictionary which will be further inserted into Hive table. I have a dictionary like this:

event_dict={"event_ID": "MO1_B", "event_Name": "Model Consumption", "event_Type": "Begin"}

I tried creating a RDD and used hiveContext.read.json(rdd) to create a dataframe but that is having one character at a time in rows:

import json
json_rdd=sc.parallelize(json.dumps(event_dict)) 
event_df=hive.read.json(json_rdd)
event_df.show()

The output of the dataframe having a single column is something like this:

{

"

e

I also tried hiveContext.createDataFrame(event_dict) but it gave me the same output.

Can you please suggest some trick to do this? I want to avoid creating a JSON file on local/HDFS and reading from it. Thanks

2 REPLIES 2

Re: How to create Spark dataframe from python dictionary object?

Super Collaborator

Just got this done, I created a RDD of dictionary instead of string and it worked:

json_rdd=sc.parallelize([event_dict])

Is there any way to maintain the same insertion order of columns in RDD as of dictionary?

Re: How to create Spark dataframe from python dictionary object?

New Contributor

While converting dict to pyspark df, column values are getting interchanged. Why is this happening?

Heres, my code:-

>>> data

{u'reviewer_display_name': u'display_name9_text', u'rating_text': u'qualit\xe9', u'review_id': u'r4eview_id', u'reviewer_profile_photo_url': u'\u092c\u0939\u0941\u0924', u'review_text': u'qualit\xe9', u'oyo_id': u'oyo1_id', u'review_update_time': u'review_updat6e_time', u'review_create_time': u'r5eview_create_time', u'review_reply_time': u'review_reply_t8ime', u'review_reply': u'review_r7eply'}

>>> keys = data.keys()

>>> keys

[u'reviewer_display_name', u'rating_text', u'review_id', u'reviewer_profile_photo_url', u'review_text', u'oyo_id', u'review_update_time', u'review_create_time', u'review_reply_time', u'review_reply']

>>> tdf = sc.parallelize([data]).toDF(keys)

>>> tdf.show(truncate=False)

+---------------------+-----------+-------------------+--------------------------+-------------+------------------+------------------+-------------------+------------------+------------+

|reviewer_display_name|rating_text|review_id|reviewer_profile_photo_url|review_text|oyo_id|review_update_time|review_create_time |review_reply_time |review_reply|

+---------------------+-----------+-------------------+--------------------------+-------------+------------------+------------------+-------------------+------------------+------------+

|oyo1_id|qualité|r5eview_create_time|r4eview_id|review_r7eply|review_reply_t8ime|qualité |review_updat6e_time|display_name9_text|बहुत|

+---------------------+-----------+-------------------+--------------------------+-------------+------------------+------------------+-------------------+------------------+------------+