Support Questions

Seaport · ‎09-09-2021

I am trying to parse a nested json document using RDD rather than DataFrame. The reason I cannot use DataFrame (the typical code is like spark.read.json) is that the document structure is very complicated. The schema detected by the reader is useless because child nodes at the same level have different schemas. So I try the script below.

import json
s='{"key1":{"myid": "123","myname":"test"}}'
rdd=sc.parallelize(s).map(json.loads)

My next step will be using map transformation to parse json string but I do not know where to start. I tried the script below but it failed.

rdd2=rdd.map(lambda j: (j[x]) for x in j)

I would appreciate any resource on using RDD transformation to parse json.

RangaReddy · ‎09-14-2021

Hi @Seaport

Please check the following example. It will may help.

https://kontext.tech/column/spark/284/pyspark-convert-json-string-column-to-array-of-object-structty...

View solution in original post

RangaReddy · ‎09-14-2021

Hi @Seaport

Please check the following example. It will may help.

https://kontext.tech/column/spark/284/pyspark-convert-json-string-column-to-array-of-object-structty...

Seaport · ‎09-15-2021

@RangaReddy

The link is exactly what I need. Thanks for your help.

Cloudera Community

Support Questions

Parse nested json using Spark RDD

Converting Nested JSON to Flat JSON using JOLT

getting null from JoltTransformJson in nifi while ...

Parsing Apache Log Files with Spark

Parsing NameNode metrics data using Python urllib2...

QueryRecord processor issue with nested JSON

How to parse XMLs in Cloudera Data Engineering wit...

Using JOLT to modify JSON Object

Spark RDD/Dataframe caching

Spark in CML: Recommendations for using Spark in C...

want to convert csv to nested json using nifi