Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Split JSON issues

Super Collaborator


I have a huge nested array JSON file 1 GB and i need to flatten it . i am using SplitJSON as my first processor and its erroring with out of memory errors. I have 16gb reserved for java heap. dont know why it errors out.

i am sure will have more than a million JSON records. how do i achieve this.?

for text files of bigger size i have used split text and then did my processing and merged before i pushed it to destination. looks like splitJSON is waiting until it splits the whole file.

How do i achieve that functionality with JSON files.?? if i use splitContent or splitText it will mess up the json format .

here is how my test json of 2 records look..

[{"Review_Id":111111111,"Brand_Id":"test","Product_Ds":"testprod1","Email_Id":"","Customer_Id":690,"Rating_No":5,"Recommend_Fg":true,"Review_Nm":"Great tasting!","Review_Ds":"I feed this to the picky dogs at my kennel. They love it!","ReviewStatus_Cd":"A","Review_Dt":"2015-05-01T17:37:28","Reviews_Answers":[{"Answer_Id":655108,"Review_Id":119205458,"Question_Ds":"Age","Answer_Ds":"35to44","Created_Dt":"2017-04-04T14:29:28"},{"Answer_Id":655109,"Review_Id":119205458,"Question_Ds":"Employee","Answer_Ds":"No","Created_Dt":"2017-04-04T14:29:28"},{"Answer_Id":655110,"Review_Id":119205458,"Question_Ds":"Taste my pet enjoys","Rating_No":5,"Created_Dt":"2017-04-04T14:29:28"},{"Answer_Id":655111,"Review_Id":119205458,"Question_Ds":"Gender","Answer_Ds":"Female","Created_Dt":"2017-04-04T14:29:28"},{"Answer_Id":655112,"Review_Id":119205458,"Question_Ds":"Number of dogs","Answer_Ds":"3","Created_Dt":"2017-04-04T14:29:28"},{"Answer_Id":655113,"Review_Id":119205458,"Question_Ds":"Quality","Rating_No":5,"Created_Dt":"2017-04-04T14:29:28"},{"Answer_Id":655114,"Review_Id":119205458,"Question_Ds":"Sample Product","Answer_Ds":"No","Created_Dt":"2017-04-04T14:29:28"},{"Answer_Id":655115,"Review_Id":119205458,"Question_Ds":"Value of Product","Rating_No":5,"Created_Dt":"2017-04-04T14:29:28"}]}, {"Review_Id":222222222,"Brand_Id":"test2","Product_Ds":"testprod2","Email_Id":"","Customer_Id":831,"Rating_No":5,"Recommend_Fg":true,"Review_Nm":"My dogs love the tender and crunchy pieces.","Review_Ds":"I have been buying this dog food for quite sometime, I have large and very small dogs and the size of the food fits for them all. They love it.","ReviewStatus_Cd":"A","Review_Dt":"2017-06-27T09:45:19","Reviews_Answers":[{"Answer_Id":1276571,"Review_Id":181705560,"Question_Ds":"*Received free food and\/or goods","Answer_Ds":"Yes","Created_Dt":"2017-07-04T10:34:34"},{"Answer_Id":1276572,"Review_Id":181705560,"Question_Ds":"Food purchased","Answer_Ds":"5","Created_Dt":"2017-07-04T10:34:34"},{"Answer_Id":1276573,"Review_Id":181705560,"Question_Ds":"Number of dogs","Answer_Ds":"5OrMore","Created_Dt":"2017-07-04T10:34:34"},{"Answer_Id":1276574,"Review_Id":181705560,"Question_Ds":"*Entered as part of a promotion","Answer_Ds":"False","Created_Dt":"2017-07-04T10:34:34"},{"Answer_Id":1276575,"Review_Id":181705560,"Question_Ds":"Would like to receive emails from","Answer_Ds":"Yes","Created_Dt":"2017-07-04T10:34:34"},{"Answer_Id":1276576,"Review_Id":181705560,"Question_Ds":"Employee","Answer_Ds":"No","Created_Dt":"2017-07-04T10:34:34"},{"Answer_Id":1276577,"Review_Id":181705560,"Question_Ds":"Quality","Rating_No":5,"Created_Dt":"2017-07-04T10:34:34"},{"Answer_Id":1276578,"Review_Id":181705560,"Question_Ds":"Taste my pet enjoys","Rating_No":5,"Created_Dt":"2017-07-04T10:34:34"},{"Answer_Id":1276579,"Review_Id":181705560,"Question_Ds":"Value for the money","Rating_No":5,"Created_Dt":"2017-07-04T10:34:34"}]}]


Super Guru

What processing are you doing to each record? You may be able to use a record-aware processor such as UpdateRecord and avoid splitting the JSON altogether.

Super Collaborator

@Matt Burgess,

i was able to use SplitRecord and split the huge JSON file into multiple files with 10K records in each.

the reason i need to do splitJSON is becasue i have nested JSON in the example below i have 6 nested arrays brnd,qa,adr,ibcn,cq,ofr for one customer. only way i know to get those are by doing splitJSON-->EvaluateJSON 6 times.. each time getting values of the elements at that level and getting array jSON and using it on next splitJSON and so on..i have attached my flow..but if there is any better way please let me know.

[ { "Customer_Id": 1111111, "brnd": [ { "Brand_Nm": "Test", "qa": [ { "Assignment_Id": 1116211, "Assign_Dt": null, "adr": [ { "AddressLine_1": null, "AddressLine_2": null, "City": null, "State_Cd": null, "Postal_Cd": "11111 ", "ibcn": [ { "BarCode_No": "162117", "cq": [ { "Vendor_Desc": "Coupons Inc", "ofr": [ { "Offer_Nm": "General_5DollarDryCatFood_EM_2016", "Offer_Ds": "On (1) bag of Purina Beyond brand dry cat food, any size, any variety.", "Offer_Expire_Dt": "2017-12-31T00:00:00", "offer_channel_desc": "EM OFFER", "SourceFeed_Ds": "A" } ] } ] } ] } ] } ] } ] } ]

Super Collaborator

@Matt Burgess , Here are the screen shots..




Super Guru

are there any logs errors or anything else in the log?

how much ram on the machine?

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.