Support Questions

Tomson · ‎01-16-2022

I have the following CSV:

mobile	username	bill_id	bill_date	product_id	product_price
9876543210	John	AB10	2020-12-23	X34	500
9876543210	John	AB10	2020-12-23	X35	230
9876543210	John	AB22	2020-11-14	X89	700
9999999999	Alice	AC54	2019-12-20	X10	109

And I want to convert them to the following structure:

{
    "users": [
    {
        "mobile": "9876543210",
        "username": "John",
        "bills": [
        {
            "bill_id": "AB10",
            "bill_date": "2020-12-23",
            "products": [{
                "product_id": "X34",
                "product_price": "500",
            },
            {
                "product_id": "X35",
                "product_price": "230",
            }]
        },
        {
            "bill_id": "AB22",
            "bill_date": "2020-11-14",
            "products": [
            {
                "product_id": "X89",
                "product_price": "700",
            }]
        }]
    },
    {
        "mobile": "9999999999",
        "username": "Alice",
        "bills": [
        {
            "bill_id": "AC54",
            "bill_date": "2019-12-20",
            "products": [
            {
                "product_id": "X10",
                "product_price": "109",
            }]
        }]
    }]
}

I wasn't able to figure out how to group the related properties (e.g. product properties with the same bill_id) into an array. How should I go about doing this?

OliverGong · ‎01-16-2022

Thank you for your question.
We can try following steps to make it work as expected.
1. Convert csv to json ( you may use ConvertRecord ), to get JSON as below

[{
	"mobile": "9876543210",
	"username": "John",
	"bill_id": "AB10",
	"bill_date": "2020-12-23",
	"product_id": "X34",
	"product_price": "500"
}, {
	"mobile": "9876543210",
	"username": "John",
	"bill_id": "AB10",
	"bill_date": "2020-12-23",
	"product_id": "X35",
	"product_price": "230"
}, {
	"mobile": "9876543210",
	"username": "John",
	"bill_id": "AB22",
	"bill_date": "2020-11-14",
	"product_id": "X89",
	"product_price": "700"
}, {
	"mobile": "9999999999",
	"username": "Alice",
	"bill_id": "AC54",
	"bill_date": "2019-12-20",
	"product_id": "X10",
	"product_price": "109"
}]

2. We will then use JoltTransformJSON to make the shift stuff, the specification of such JOLT is listed as:

[ {
  "operation" : "shift",
  "spec" : {
    "*" : {
      "*" : "@(1,username)@(1,mobile)@(1,bill_id)@(1,bill_date)@(1,product_id).&"
    }
  }
}, {
  "operation" : "shift",
  "spec" : {
    "*" : {
      "$" : "users.[#2].username",
      "*" : {
        "$" : "users.[#3].mobile",
        "*" : {
          "*" : {
            "$" : "users.[#5].bills.[#3].bill_date",
            "$1" : "users.[#5].bills.[#3].bill_id",
            "*" : {
              "product_*" : "users.[#6].bills.[#4].products.[#2].&"
            }
          }
        }
      }
    }
  }
} ]

NOTE:
- For the JOLT spec, we gonna group the raw data into organic fields, which means we will get the data which are grouped by "username + mobile + bill_id +bill_date + product_id " fields.
- After that, we get the phase-1 shifted data that are ordered by unique keys.
- For the phase-2 shift, we gonna list out the data as predefined logic which is inferred from your attached sample output.

3. For details on JOLT spec, you may refer to the following link:
https://jolt-demo.appspot.com/

Hope this helps.
Thanks
Oliver Gong

View solution in original post

OliverGong · ‎01-16-2022

Thank you for your question.
We can try following steps to make it work as expected.
1. Convert csv to json ( you may use ConvertRecord ), to get JSON as below

[{
	"mobile": "9876543210",
	"username": "John",
	"bill_id": "AB10",
	"bill_date": "2020-12-23",
	"product_id": "X34",
	"product_price": "500"
}, {
	"mobile": "9876543210",
	"username": "John",
	"bill_id": "AB10",
	"bill_date": "2020-12-23",
	"product_id": "X35",
	"product_price": "230"
}, {
	"mobile": "9876543210",
	"username": "John",
	"bill_id": "AB22",
	"bill_date": "2020-11-14",
	"product_id": "X89",
	"product_price": "700"
}, {
	"mobile": "9999999999",
	"username": "Alice",
	"bill_id": "AC54",
	"bill_date": "2019-12-20",
	"product_id": "X10",
	"product_price": "109"
}]

2. We will then use JoltTransformJSON to make the shift stuff, the specification of such JOLT is listed as:

[ {
  "operation" : "shift",
  "spec" : {
    "*" : {
      "*" : "@(1,username)@(1,mobile)@(1,bill_id)@(1,bill_date)@(1,product_id).&"
    }
  }
}, {
  "operation" : "shift",
  "spec" : {
    "*" : {
      "$" : "users.[#2].username",
      "*" : {
        "$" : "users.[#3].mobile",
        "*" : {
          "*" : {
            "$" : "users.[#5].bills.[#3].bill_date",
            "$1" : "users.[#5].bills.[#3].bill_id",
            "*" : {
              "product_*" : "users.[#6].bills.[#4].products.[#2].&"
            }
          }
        }
      }
    }
  }
} ]

NOTE:
- For the JOLT spec, we gonna group the raw data into organic fields, which means we will get the data which are grouped by "username + mobile + bill_id +bill_date + product_id " fields.
- After that, we get the phase-1 shifted data that are ordered by unique keys.
- For the phase-2 shift, we gonna list out the data as predefined logic which is inferred from your attached sample output.

3. For details on JOLT spec, you may refer to the following link:
https://jolt-demo.appspot.com/

Hope this helps.
Thanks
Oliver Gong

OliverGong · ‎01-16-2022

Attached the corresponding screenshots to cover the detail on implementation.
0. The whole data flow is like:

1.GenerateFlowFile:

2. ConvertRecord:

2.1 CSV Reader(LG CSVReader):

2.2 JsonRecordSetWriter (LG JsonRecordSetWriter):

3. JoltTransformJSON

3.1 JoltTransformJSON - Advanced Mode:

Should you have any questions, just feel free to contact me.
Thanks,
Oliver Gong

Tomson · ‎01-17-2022

Thanks for the solution! It solves my use case perfectly.

Just one question though. I expect the input JSON to the Jolt processor to be somewhat large (larger than the available RAM). In that case, is it absolutely necessary to split the input into multiple files?

OliverGong · ‎01-17-2022

Glad to hear that works for you!
Though we can split the raw csv or json content into a smaller size just in case of OOM issue when doing the shifting stuff on JoltTransformJSON, it may cause other issue like:
- result from JoltTransformJSON is not complete. That means, some part of the same user data ( with different bill_date) may be wrapped in the other flow files. In such case, we will need to merge them back as a whole.

- It would be better if we can keep raw csv data in some DB table. Then drag out the data rows with a specified limit--> we can use "split pages" way to query on such table,
- We can then easily fetch a logical completed data to do the rest shift things. With such completed result, we don't need to worry about the data is not an info-incomplete one.

Cloudera Community

Support Questions

How to convert multiple related rows in CSV into nested JSON using Nifi?