Created on 01-16-2022 10:52 AM - edited 01-16-2022 11:01 AM
I have the following CSV:
mobile | username | bill_id | bill_date | product_id | product_price |
9876543210 | John | AB10 | 2020-12-23 | X34 | 500 |
9876543210 | John | AB10 | 2020-12-23 | X35 | 230 |
9876543210 | John | AB22 | 2020-11-14 | X89 | 700 |
9999999999 | Alice | AC54 | 2019-12-20 | X10 | 109 |
And I want to convert them to the following structure:
{
"users": [
{
"mobile": "9876543210",
"username": "John",
"bills": [
{
"bill_id": "AB10",
"bill_date": "2020-12-23",
"products": [{
"product_id": "X34",
"product_price": "500",
},
{
"product_id": "X35",
"product_price": "230",
}]
},
{
"bill_id": "AB22",
"bill_date": "2020-11-14",
"products": [
{
"product_id": "X89",
"product_price": "700",
}]
}]
},
{
"mobile": "9999999999",
"username": "Alice",
"bills": [
{
"bill_id": "AC54",
"bill_date": "2019-12-20",
"products": [
{
"product_id": "X10",
"product_price": "109",
}]
}]
}]
}
I wasn't able to figure out how to group the related properties (e.g. product properties with the same bill_id) into an array. How should I go about doing this?
Created on 01-16-2022 09:48 PM - edited 01-16-2022 09:50 PM
Thank you for your question.
We can try following steps to make it work as expected.
1. Convert csv to json ( you may use ConvertRecord ), to get JSON as below
[{
"mobile": "9876543210",
"username": "John",
"bill_id": "AB10",
"bill_date": "2020-12-23",
"product_id": "X34",
"product_price": "500"
}, {
"mobile": "9876543210",
"username": "John",
"bill_id": "AB10",
"bill_date": "2020-12-23",
"product_id": "X35",
"product_price": "230"
}, {
"mobile": "9876543210",
"username": "John",
"bill_id": "AB22",
"bill_date": "2020-11-14",
"product_id": "X89",
"product_price": "700"
}, {
"mobile": "9999999999",
"username": "Alice",
"bill_id": "AC54",
"bill_date": "2019-12-20",
"product_id": "X10",
"product_price": "109"
}]
2. We will then use JoltTransformJSON to make the shift stuff, the specification of such JOLT is listed as:
[ {
"operation" : "shift",
"spec" : {
"*" : {
"*" : "@(1,username)@(1,mobile)@(1,bill_id)@(1,bill_date)@(1,product_id).&"
}
}
}, {
"operation" : "shift",
"spec" : {
"*" : {
"$" : "users.[#2].username",
"*" : {
"$" : "users.[#3].mobile",
"*" : {
"*" : {
"$" : "users.[#5].bills.[#3].bill_date",
"$1" : "users.[#5].bills.[#3].bill_id",
"*" : {
"product_*" : "users.[#6].bills.[#4].products.[#2].&"
}
}
}
}
}
}
} ]
NOTE:
- For the JOLT spec, we gonna group the raw data into organic fields, which means we will get the data which are grouped by "username + mobile + bill_id +bill_date + product_id " fields.
- After that, we get the phase-1 shifted data that are ordered by unique keys.
- For the phase-2 shift, we gonna list out the data as predefined logic which is inferred from your attached sample output.
3. For details on JOLT spec, you may refer to the following link:
https://jolt-demo.appspot.com/
Hope this helps.
Thanks
Oliver Gong
Created on 01-16-2022 09:48 PM - edited 01-16-2022 09:50 PM
Thank you for your question.
We can try following steps to make it work as expected.
1. Convert csv to json ( you may use ConvertRecord ), to get JSON as below
[{
"mobile": "9876543210",
"username": "John",
"bill_id": "AB10",
"bill_date": "2020-12-23",
"product_id": "X34",
"product_price": "500"
}, {
"mobile": "9876543210",
"username": "John",
"bill_id": "AB10",
"bill_date": "2020-12-23",
"product_id": "X35",
"product_price": "230"
}, {
"mobile": "9876543210",
"username": "John",
"bill_id": "AB22",
"bill_date": "2020-11-14",
"product_id": "X89",
"product_price": "700"
}, {
"mobile": "9999999999",
"username": "Alice",
"bill_id": "AC54",
"bill_date": "2019-12-20",
"product_id": "X10",
"product_price": "109"
}]
2. We will then use JoltTransformJSON to make the shift stuff, the specification of such JOLT is listed as:
[ {
"operation" : "shift",
"spec" : {
"*" : {
"*" : "@(1,username)@(1,mobile)@(1,bill_id)@(1,bill_date)@(1,product_id).&"
}
}
}, {
"operation" : "shift",
"spec" : {
"*" : {
"$" : "users.[#2].username",
"*" : {
"$" : "users.[#3].mobile",
"*" : {
"*" : {
"$" : "users.[#5].bills.[#3].bill_date",
"$1" : "users.[#5].bills.[#3].bill_id",
"*" : {
"product_*" : "users.[#6].bills.[#4].products.[#2].&"
}
}
}
}
}
}
} ]
NOTE:
- For the JOLT spec, we gonna group the raw data into organic fields, which means we will get the data which are grouped by "username + mobile + bill_id +bill_date + product_id " fields.
- After that, we get the phase-1 shifted data that are ordered by unique keys.
- For the phase-2 shift, we gonna list out the data as predefined logic which is inferred from your attached sample output.
3. For details on JOLT spec, you may refer to the following link:
https://jolt-demo.appspot.com/
Hope this helps.
Thanks
Oliver Gong
Created on 01-16-2022 11:13 PM - edited 01-16-2022 11:19 PM
Attached the corresponding screenshots to cover the detail on implementation.
0. The whole data flow is like:
1.GenerateFlowFile:
2. ConvertRecord:
2.1 CSV Reader(LG CSVReader):
2.2 JsonRecordSetWriter (LG JsonRecordSetWriter):
3. JoltTransformJSON
3.1 JoltTransformJSON - Advanced Mode:
Should you have any questions, just feel free to contact me.
Thanks,
Oliver Gong
Created 01-17-2022 05:09 AM
Thanks for the solution! It solves my use case perfectly.
Just one question though. I expect the input JSON to the Jolt processor to be somewhat large (larger than the available RAM). In that case, is it absolutely necessary to split the input into multiple files?
Created 01-17-2022 06:46 AM
Glad to hear that works for you!
Though we can split the raw csv or json content into a smaller size just in case of OOM issue when doing the shifting stuff on JoltTransformJSON, it may cause other issue like:
- result from JoltTransformJSON is not complete. That means, some part of the same user data ( with different bill_date) may be wrapped in the other flow files. In such case, we will need to merge them back as a whole.
- It would be better if we can keep raw csv data in some DB table. Then drag out the data rows with a specified limit--> we can use "split pages" way to query on such table,
- We can then easily fetch a logical completed data to do the rest shift things. With such completed result, we don't need to worry about the data is not an info-incomplete one.