Created 09-09-2021 01:18 AM
I am trying to parse a nested json document using RDD rather than DataFrame. The reason I cannot use DataFrame (the typical code is like spark.read.json) is that the document structure is very complicated. The schema detected by the reader is useless because child nodes at the same level have different schemas. So I try the script below.
import json
s='{"key1":{"myid": "123","myname":"test"}}'
rdd=sc.parallelize(s).map(json.loads)
My next step will be using map transformation to parse json string but I do not know where to start. I tried the script below but it failed.
rdd2=rdd.map(lambda j: (j[x]) for x in j)
I would appreciate any resource on using RDD transformation to parse json.
Created 09-14-2021 10:46 PM
Created 09-14-2021 10:46 PM
Created 09-15-2021 10:32 PM