- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to convert multiple related rows in CSV into nested JSON using Nifi?
- Labels:
-
Apache NiFi
Created on ‎01-16-2022 10:52 AM - edited ‎01-16-2022 11:01 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the following CSV:
mobile | username | bill_id | bill_date | product_id | product_price |
9876543210 | John | AB10 | 2020-12-23 | X34 | 500 |
9876543210 | John | AB10 | 2020-12-23 | X35 | 230 |
9876543210 | John | AB22 | 2020-11-14 | X89 | 700 |
9999999999 | Alice | AC54 | 2019-12-20 | X10 | 109 |
And I want to convert them to the following structure:
{
"users": [
{
"mobile": "9876543210",
"username": "John",
"bills": [
{
"bill_id": "AB10",
"bill_date": "2020-12-23",
"products": [{
"product_id": "X34",
"product_price": "500",
},
{
"product_id": "X35",
"product_price": "230",
}]
},
{
"bill_id": "AB22",
"bill_date": "2020-11-14",
"products": [
{
"product_id": "X89",
"product_price": "700",
}]
}]
},
{
"mobile": "9999999999",
"username": "Alice",
"bills": [
{
"bill_id": "AC54",
"bill_date": "2019-12-20",
"products": [
{
"product_id": "X10",
"product_price": "109",
}]
}]
}]
}
I wasn't able to figure out how to group the related properties (e.g. product properties with the same bill_id) into an array. How should I go about doing this?
Created on ‎01-16-2022 09:48 PM - edited ‎01-16-2022 09:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your question.
We can try following steps to make it work as expected.
1. Convert csv to json ( you may use ConvertRecord ), to get JSON as below
[{
"mobile": "9876543210",
"username": "John",
"bill_id": "AB10",
"bill_date": "2020-12-23",
"product_id": "X34",
"product_price": "500"
}, {
"mobile": "9876543210",
"username": "John",
"bill_id": "AB10",
"bill_date": "2020-12-23",
"product_id": "X35",
"product_price": "230"
}, {
"mobile": "9876543210",
"username": "John",
"bill_id": "AB22",
"bill_date": "2020-11-14",
"product_id": "X89",
"product_price": "700"
}, {
"mobile": "9999999999",
"username": "Alice",
"bill_id": "AC54",
"bill_date": "2019-12-20",
"product_id": "X10",
"product_price": "109"
}]
2. We will then use JoltTransformJSON to make the shift stuff, the specification of such JOLT is listed as:
[ {
"operation" : "shift",
"spec" : {
"*" : {
"*" : "@(1,username)@(1,mobile)@(1,bill_id)@(1,bill_date)@(1,product_id).&"
}
}
}, {
"operation" : "shift",
"spec" : {
"*" : {
"$" : "users.[#2].username",
"*" : {
"$" : "users.[#3].mobile",
"*" : {
"*" : {
"$" : "users.[#5].bills.[#3].bill_date",
"$1" : "users.[#5].bills.[#3].bill_id",
"*" : {
"product_*" : "users.[#6].bills.[#4].products.[#2].&"
}
}
}
}
}
}
} ]
NOTE:
- For the JOLT spec, we gonna group the raw data into organic fields, which means we will get the data which are grouped by "username + mobile + bill_id +bill_date + product_id " fields.
- After that, we get the phase-1 shifted data that are ordered by unique keys.
- For the phase-2 shift, we gonna list out the data as predefined logic which is inferred from your attached sample output.
3. For details on JOLT spec, you may refer to the following link:
https://jolt-demo.appspot.com/
Hope this helps.
Thanks
Oliver Gong
Created on ‎01-16-2022 09:48 PM - edited ‎01-16-2022 09:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your question.
We can try following steps to make it work as expected.
1. Convert csv to json ( you may use ConvertRecord ), to get JSON as below
[{
"mobile": "9876543210",
"username": "John",
"bill_id": "AB10",
"bill_date": "2020-12-23",
"product_id": "X34",
"product_price": "500"
}, {
"mobile": "9876543210",
"username": "John",
"bill_id": "AB10",
"bill_date": "2020-12-23",
"product_id": "X35",
"product_price": "230"
}, {
"mobile": "9876543210",
"username": "John",
"bill_id": "AB22",
"bill_date": "2020-11-14",
"product_id": "X89",
"product_price": "700"
}, {
"mobile": "9999999999",
"username": "Alice",
"bill_id": "AC54",
"bill_date": "2019-12-20",
"product_id": "X10",
"product_price": "109"
}]
2. We will then use JoltTransformJSON to make the shift stuff, the specification of such JOLT is listed as:
[ {
"operation" : "shift",
"spec" : {
"*" : {
"*" : "@(1,username)@(1,mobile)@(1,bill_id)@(1,bill_date)@(1,product_id).&"
}
}
}, {
"operation" : "shift",
"spec" : {
"*" : {
"$" : "users.[#2].username",
"*" : {
"$" : "users.[#3].mobile",
"*" : {
"*" : {
"$" : "users.[#5].bills.[#3].bill_date",
"$1" : "users.[#5].bills.[#3].bill_id",
"*" : {
"product_*" : "users.[#6].bills.[#4].products.[#2].&"
}
}
}
}
}
}
} ]
NOTE:
- For the JOLT spec, we gonna group the raw data into organic fields, which means we will get the data which are grouped by "username + mobile + bill_id +bill_date + product_id " fields.
- After that, we get the phase-1 shifted data that are ordered by unique keys.
- For the phase-2 shift, we gonna list out the data as predefined logic which is inferred from your attached sample output.
3. For details on JOLT spec, you may refer to the following link:
https://jolt-demo.appspot.com/
Hope this helps.
Thanks
Oliver Gong
Created on ‎01-16-2022 11:13 PM - edited ‎01-16-2022 11:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Attached the corresponding screenshots to cover the detail on implementation.
0. The whole data flow is like:
1.GenerateFlowFile:
2. ConvertRecord:
2.1 CSV Reader(LG CSVReader):
2.2 JsonRecordSetWriter (LG JsonRecordSetWriter):
3. JoltTransformJSON
3.1 JoltTransformJSON - Advanced Mode:
Should you have any questions, just feel free to contact me.
Thanks,
Oliver Gong
Created ‎01-17-2022 05:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the solution! It solves my use case perfectly.
Just one question though. I expect the input JSON to the Jolt processor to be somewhat large (larger than the available RAM). In that case, is it absolutely necessary to split the input into multiple files?
Created ‎01-17-2022 06:46 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Glad to hear that works for you!
Though we can split the raw csv or json content into a smaller size just in case of OOM issue when doing the shifting stuff on JoltTransformJSON, it may cause other issue like:
- result from JoltTransformJSON is not complete. That means, some part of the same user data ( with different bill_date) may be wrapped in the other flow files. In such case, we will need to merge them back as a whole.
- It would be better if we can keep raw csv data in some DB table. Then drag out the data rows with a specified limit--> we can use "split pages" way to query on such table,
- We can then easily fetch a logical completed data to do the rest shift things. With such completed result, we don't need to worry about the data is not an info-incomplete one.
