- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to split large json file into multiple json files in Nifi?
- Labels:
-
Apache NiFi
Created 02-17-2023 09:47 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have a large json file which is more than 100GB and we want to split this json file into multiple files. We used Split Text processor to split this json file into mutliple files by specifying Line Split Count. Is there any way we can pass attribute/variable in Line Split Count and then split the records based on the attribute/variable as currently Line Split Count does not support attributes/variables.
Kindly suggest if there is another approach to split these json files based on attribute/variables
Sample Json File
{"name": "John","lastName": "Wick","phoneNumber": "123123123"}
{"name": "Paul","lastName": "Jackson","phoneNumber": "123123123"}
{"name": "Paul","lastName": "Jackson","phoneNumber": "123123123"}
Created 02-27-2023 09:25 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes SplitRecord is what you should use.
Attached is a flow definition as an example.
Note that I had to rename the file with a "txt" extension once you download it rename it to a .json extension
You can then drag a processor group and it gives you an option to upload the flow definition.
That example generates a file with 102 records and on SlitRecord we use a JsontTreeReader that will split by 3 records and writes the flowfiles out, In this case per 3 per flowFile generating 34 FlowFiles.
1-2 / 3 = 34
In your case and based on your screenshot I would change split count to be 1500000 ( or another number based on your needs )
Created 02-19-2023 08:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Try to look into QueryRecord or PartitionRecord Processors. Those might help.
Thanks
Created 02-23-2023 08:37 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Both QueryRecord and PartitionRecord do not fit this use case, I have tried it. Can SplitRecord processor be used this purpose, is yes can you provide an example based on the above sample records?
Created 02-27-2023 09:25 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes SplitRecord is what you should use.
Attached is a flow definition as an example.
Note that I had to rename the file with a "txt" extension once you download it rename it to a .json extension
You can then drag a processor group and it gives you an option to upload the flow definition.
That example generates a file with 102 records and on SlitRecord we use a JsontTreeReader that will split by 3 records and writes the flowfiles out, In this case per 3 per flowFile generating 34 FlowFiles.
1-2 / 3 = 34
In your case and based on your screenshot I would change split count to be 1500000 ( or another number based on your needs )
Created 03-02-2023 09:42 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@rahul_loke Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks
Regards,
Diana Torres,Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
