Support Questions

Seaport · ‎09-09-2021

I am trying to parse a nested json document using RDD rather than DataFrame. The reason I cannot use DataFrame (the typical code is like spark.read.json) is that the document structure is very complicated. The schema detected by the reader is useless because child nodes at the same level have different schemas. So I try the script below.

import json
s='{"key1":{"myid": "123","myname":"test"}}'
rdd=sc.parallelize(s).map(json.loads)

My next step will be using map transformation to parse json string but I do not know where to start. I tried the script below but it failed.

rdd2=rdd.map(lambda j: (j[x]) for x in j)

I would appreciate any resource on using RDD transformation to parse json.

RangaReddy · ‎09-14-2021

Hi @Seaport

Please check the following example. It will may help.

https://kontext.tech/column/spark/284/pyspark-convert-json-string-column-to-array-of-object-structty...

View solution in original post

RangaReddy · ‎09-14-2021

Hi @Seaport

Please check the following example. It will may help.

https://kontext.tech/column/spark/284/pyspark-convert-json-string-column-to-array-of-object-structty...

Seaport · ‎09-15-2021

@RangaReddy

The link is exactly what I need. Thanks for your help.

Cloudera Community

Support Questions

Parse nested json using Spark RDD