Support Questions

Find answers, ask questions, and share your expertise

Nifi JoltTransformJSON to remove duplicate Json records from Json array - Jolt

avatar
Rising Star

I am trying to remove duplicate json records from json array using Jolt transformation .

Here is an example i tried : Input : [ { "id": 1, "name": "jeorge", "age": 25 }, { "id": 2, "name": "manhan", "age": 25 }, { "id": 1, "name": "george", "age": 225 } ]

Jolt script : [ { "operation": "shift", "spec": { "*": { "id": "[&1].id" } } } ]

Output : [ { "id" : 1 }, { "id" : 2 }, { "id" : 1 } ]

getting only selected records . along with that , i would like to remove duplicates . Desired Output : [ { "id" : 1 }, { "id" : 2 } ]

Please provide the necessary script which will help me . Thanks in advance .

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi @srini,

Try the following with the Chain specification type selected in the processor:

[ { "operation": "shift", "spec": { "*": { "id": "[&1].id" } } }, { "operation": "shift", "spec": { "*": "@id[]" } }, { "operation": "cardinality", "spec": { "*": { "@": "ONE" } } }, { "operation": "shift", "spec": { "*": { "id": "[].id" } } } ]

What I did was add a bucket shift operation which sorts same json entries into "buckets", used cardinality to select just one entry and then another shift for the output you need. Please let me know if this works for you.

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

Hi @srini,

Try the following with the Chain specification type selected in the processor:

[ { "operation": "shift", "spec": { "*": { "id": "[&1].id" } } }, { "operation": "shift", "spec": { "*": "@id[]" } }, { "operation": "cardinality", "spec": { "*": { "@": "ONE" } } }, { "operation": "shift", "spec": { "*": { "id": "[].id" } } } ]

What I did was add a bucket shift operation which sorts same json entries into "buckets", used cardinality to select just one entry and then another shift for the output you need. Please let me know if this works for you.

avatar
Rising Star

first of all thank you @Yolanda M. Davis for quick response .

Correct me if i am wrong , the above solution may work only eliminate duplicate json records based on one field, but we have a senarios like eliminating duplicates based on multiple fields . in the below example domain,location,time,function,unit Please provide the scripts to process in jolt . Thanks .

or

I can say simply eliminate duplicate json files from array of json

Input :

[{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.yahoo.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "AOI_S1", "unit": "AOI_L31" },

{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "ALIGN", "unit": "ALIGN2" },

{ "domain": "www.yahoo.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.google.com", "location": "texas", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.hortonworks.com", "location": "newyork", "time": "CDT UTC-0500", "function": "ALIGN", "unit": "ALIGN2" } ]

Desired output :

[{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.yahoo.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR", { "domain": "www.google.com", "location": "texas", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.hortonworks.com", "location": "newyork", "time": "CDT UTC-0500", "function": "ALIGN", "unit": "ALIGN2" } ]

avatar
Expert Contributor

@srini one thing you could try is instead of using that one attribute to bucket on, create another attribute that denormalizes all of the attributes into one string, and use that column to bucket on (still leave the other attributes in place). When you have duplicate columns this would lead to those dupes being bucketed under that one column in the subsequent operation. Then the rest of the operations would pick the unique one and then remove the denormalized column. It's a bit of a dance but I think could work. Does that makes sense?

One thing I think would be good to get on the radar is upgrading Jolt in NiFi, perhaps once the modify feature upgrades from beta. I think that will help to simplify some of the hoops needed to do this type of work.