Created 09-17-2017 09:02 AM
I am in a scenario where others define & change the schema on a regular basis and the data pipeline needs to pass through the data. Our developers are using a message envelope with some common fields, and then an array of individual messages. So in my case I might have something like:
{
"user_id": 123,
"other_root_field": "blah",
"parent": {
"events": [
{
"nested_1": "a",
"nested_2": "b"
},
{
"nested_3": "c",
"nested_1": "d"
}
]
}
}
What I want to do is pull out all the individual events, add the data from the envelope and write them to Kafka (still in JSON format). It seems like I should use the JoltTransformJSON processor, followed by a SplitJSON process & finally a KafkaProducer (please correct me if there is a better way).
The first event from the example above would look like:
{"user_id":123,"other_root_field":"blah","exploded_nested_1":"a","exploded_nested_2":"b"}Note that the fields from the array have an "exploded_" prefix added - this is to avoid name collision between any fields defined on the envelope and those in the individual events.
To get there it seems like I should produce this from Jolt:
[
{
"user_id": 123,
"other_root_field": "blah",
"exploded_nested_1": "a",
"exploded_nested_2": "b"
},
{
"user_id": 123,
"other_root_field": "blah",
"exploded_nested_3": "c",
"exploded_nested_1": "d"
}
]
I can't seem to quite get there however:
1. I can't get Jolt to add the prefix to the fields in the array.
[{"operation":"shift","spec":{"parent":{"events":{"*":{"@":"[exploded_&]"}}}}}]This gives me an error that exploded_& is an invalid index for the array. Using just [&] will output the existing field names though.
2. I can't figure out how to include fields on the root, but exclude the "parent" that holds the array.
[{"operation":"shift","spec":{"parent":{"events":{"*":{"@3":"[&]"}}}}}]Will get me an array entry for every event with all data in each one - I need a way to say all events on the root except "parent".
Help would be greatly appreciated.
Thanks,
--Ben
Created 09-17-2017 02:41 PM
Hi @Ben Vogan,
you can make use of below jolt specification
[
{
"operation": "shift",
"spec": {
"parent": {
"events": {
"*": {
"@(3,user_id)": "events[&1].user_id",
"@(4,other_root_field)": "events[&1].other_root_field",
"nested_1": "events[&1].exploded_nested_1",
"nested_2": "events[&1].exploded_nested_2",
"nested_3": "events[&1].exploded_nested_3"
}
}
}
}
}
]input:-
{
"user_id": 123,
"other_root_field": "blah",
"parent": {
"events": [
{
"nested_1": "a",
"nested_2": "b"
},
{
"nested_3": "c",
"nested_1": "d"
}
]
}
}output:-
{
"events" : [ {
"user_id" : 123,
"other_root_field" : "blah",
"exploded_nested_1" : "a",
"exploded_nested_2" : "b"
}, {
"user_id" : 123,
"other_root_field" : "blah",
"exploded_nested_1" : "d",
"exploded_nested_3" : "c"
} ]
}Created 09-17-2017 03:08 PM
@Yash thanks for your reply. However my problem is that I do not know the set of fields (either on the root, or inside the array elements) - it is always changing and I don't want to have to update the spec every time someone adds a field. The spec should only know about parent.events and not assume the existence of any other field. I need a way to say "copy everything at the root, except for the parent field."
What I've done for the moment is just implemented the logic in Jython - although it is fairly slow.