Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Splitting a large Json Array file?

Splitting a large Json Array file?

New Contributor

I have a very large json array (1 gig in size) that I want to write to ElasticSearch.

The format is [{doc1},{doc2},...{doc100000}]..

I tried to use the SplitJson Processor but it causes an out of memory exception. I believe it's because the processor is trying to create thousands of flow files for each document.

Can this be done with the ConvertRecord processor? If so would I just use a JsonPathReader with a JsonRecordSetWriter?

2 REPLIES 2
Highlighted

Re: Splitting a large Json Array file?

Super Guru

@Harry Yuen

When you are using SplitJson the processor splits each document as a new flowfile.

Use record oriented processor PutElasticSearchHttpRecord (if this processor serves your purposes) processor configure/enable Record Reader controller service,as this processor reads the flowfile content and put into Elastic search.

Then you don't need to do any split for the json array.

(or)

Use series of SplitRecord processor/s to create single document as flowfile content and configure JsonPathReader with a JsonRecordSetWriter and give your Records Per Split number so that processor will create those chunks of files.

Flow:

1.SplitRecord //to create 100k chunks flowfile
2.SplitRecord //to create 1 record flowfile
3.PutElasticSearch

By using series of splitrecords to create single document as flowfile content we are going to mitigate getting out of memory issue.

Re: Splitting a large Json Array file?

New Contributor

@Shu

I tried to use the PutElasticSearchHttpRecord with a JsonRecordSetWriter and I still get an out of memory exception. I will try your second suggestion and let you know how that goes.

Don't have an account?
Coming from Hortonworks? Activate your account here