Member since
11-08-2016
19
Posts
3
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2942 | 05-23-2018 05:13 PM |
05-23-2018
05:13 PM
Finally figured out.need to set hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
... View more
11-16-2016
01:26 PM
End up modifying SplitJson.java to include original content as below: {"RESULT":[{"SPLIT":{ }, "ORIGINAL":{ }]}
... View more
09-17-2017
09:02 AM
I am trying to do something very similar, but I do not know what fields are going to exist on the JSON other than the one that contains the array. I'm using this in a scenario where others define & change the schema on a regular basis and the data pipeline needs to pass through the data. Our developers are using a message envelope with some common fields, and then an array of individual messages. So in my case I might have something like: {
"user_id": 123,
"other_root_field": "blah",
"parent": {
"events": [
{
"nested_1": "a",
"nested_2": "b"
},
{
"nested_3": "c",
"nested_1": "d"
}
]
}
} What I want to do is pull out all the individual events, add the data from the envelope and write them to Kafka (still in JSON format). Looking at the above answer it seems like I should use the JoltTransformJSON processor, followed by a SplitJSON process & finally a KafkaProducer. The first event from the example above would look like: {
"user_id": 123,
"other_root_field": "blah",
"exploded_nested_1": "a",
"exploded_nested_2": "b"
} Note that the fields from the array have an "exploded_" prefix added - this is to avoid name collision between any fields defined on the envelope and those in the individual events. To get there it seems like I should produce this from Jolt: [
{
"user_id": 123,
"other_root_field": "blah",
"exploded_nested_1": "a",
"exploded_nested_2": "b"
},
{
"user_id": 123,
"other_root_field": "blah",
"exploded_nested_3": "c",
"exploded_nested_1": "d"
}
] I can't seem to get there from the answer above - although it seems like I should. 1. I can't get Jolt to add the prefix to the fields in the array. [{
"operation": "shift",
"spec": {
"parent": {
"events": {
"*": {
"@": "[exploded_&]"
}
}
}
}
}]
This gives me an error that exploded_& is an invalid index for the array. Using just [&] will output the existing field names though. 2. I can't figure out how to include fields on the root, but exclude the "parent" that holds the array. [{
"operation": "shift",
"spec": {
"parent": {
"events": {
"*": {
"@3": "[&]"
}
}
}
}
}]
Will get me an array entry for every event with all data in each one - I need a way to say all events on the root except "parent". Help would be greatly appreciated. Thanks, --Ben
... View more