Support Questions

Seaport · ‎04-20-2022

I saved thousands of small json files in SequenceFile format to resolve the "small file issue". I use the following pyspark code to parse the json data from saved sequence files.

reader= sc.sequenceFile("/mysequencefile_dir", "org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text")
rdd=reader.map(lambda x: x[1])
mydf=spark.read.schema(myschema).json(rdd)
mydf.show(truncate=False)

The code worked. However, I do not know how to put the key value from the sequence file, which is actually the original json file name, into the mydf dataframe. Please advise. Thank you.

Regards,

araujo · ‎04-20-2022

@Seaport ,

Please try the below:

import json

def jsonize(k, v):
    ret = json.loads(v)
    ret.update({'key': k})
    return ret

...
rdd = reader.map(lambda x: jsonize(*x))
...

You need to make sure your schema includes the added key column.

Cheers,

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

View solution in original post

Seaport · ‎04-20-2022

I did a workaround by injecting the myfilepath element into the json string.

rdd=reader.map(lambda x: str(x[1])[0]+'"myfilepath":"'+x[0]+'",'+str(x[1])[1:])

It does not look like a very clean solution. Is there a better one? Thanks.

Regards

araujo · ‎04-20-2022

@Seaport ,

Please try the below:

import json

def jsonize(k, v):
    ret = json.loads(v)
    ret.update({'key': k})
    return ret

...
rdd = reader.map(lambda x: jsonize(*x))
...

You need to make sure your schema includes the added key column.

Cheers,

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Seaport · ‎04-21-2022

André,

Thanks for the elegant solution.

Regards,

Cloudera Community

Support Questions

Putting SequenceFile key value into a data frame

Basic CDC in Hadoop using Spark with Data Frames

Explode function in Data Frames

Using Hive Warehouse Connector (HWC) with Cloudera...

data frames and data sets

NoClassDefFoundError thrown when using TypedBytesW...

How to capture both key and value of json data wit...

Put data from Parquet files into DynamoDB with NiF...

ERROR Could not find value for key log4j.appender....

Creating a value in json only if the key exists us...

Apache Nifi: Insert json data into table as single...