Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Nifi JoltTransformJSON to remove duplicate Json records from Json array - Jolt

Solved Go to solution

Nifi JoltTransformJSON to remove duplicate Json records from Json array - Jolt

I am trying to remove duplicate json records from json array using Jolt transformation .

Here is an example i tried : Input : [ { "id": 1, "name": "jeorge", "age": 25 }, { "id": 2, "name": "manhan", "age": 25 }, { "id": 1, "name": "george", "age": 225 } ]

Jolt script : [ { "operation": "shift", "spec": { "*": { "id": "[&1].id" } } } ]

Output : [ { "id" : 1 }, { "id" : 2 }, { "id" : 1 } ]

getting only selected records . along with that , i would like to remove duplicates . Desired Output : [ { "id" : 1 }, { "id" : 2 } ]

Please provide the necessary script which will help me . Thanks in advance .

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Nifi JoltTransformJSON to remove duplicate Json records from Json array - Jolt

Rising Star

Hi @srini,

Try the following with the Chain specification type selected in the processor:

[ { "operation": "shift", "spec": { "*": { "id": "[&1].id" } } }, { "operation": "shift", "spec": { "*": "@id[]" } }, { "operation": "cardinality", "spec": { "*": { "@": "ONE" } } }, { "operation": "shift", "spec": { "*": { "id": "[].id" } } } ]

What I did was add a bucket shift operation which sorts same json entries into "buckets", used cardinality to select just one entry and then another shift for the output you need. Please let me know if this works for you.

View solution in original post

3 REPLIES 3
Highlighted

Re: Nifi JoltTransformJSON to remove duplicate Json records from Json array - Jolt

Rising Star

Hi @srini,

Try the following with the Chain specification type selected in the processor:

[ { "operation": "shift", "spec": { "*": { "id": "[&1].id" } } }, { "operation": "shift", "spec": { "*": "@id[]" } }, { "operation": "cardinality", "spec": { "*": { "@": "ONE" } } }, { "operation": "shift", "spec": { "*": { "id": "[].id" } } } ]

What I did was add a bucket shift operation which sorts same json entries into "buckets", used cardinality to select just one entry and then another shift for the output you need. Please let me know if this works for you.

View solution in original post

Highlighted

Re: Nifi JoltTransformJSON to remove duplicate Json records from Json array - Jolt

first of all thank you @Yolanda M. Davis for quick response .

Correct me if i am wrong , the above solution may work only eliminate duplicate json records based on one field, but we have a senarios like eliminating duplicates based on multiple fields . in the below example domain,location,time,function,unit Please provide the scripts to process in jolt . Thanks .

or

I can say simply eliminate duplicate json files from array of json

Input :

[{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.yahoo.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "AOI_S1", "unit": "AOI_L31" },

{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "ALIGN", "unit": "ALIGN2" },

{ "domain": "www.yahoo.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.google.com", "location": "texas", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.hortonworks.com", "location": "newyork", "time": "CDT UTC-0500", "function": "ALIGN", "unit": "ALIGN2" } ]

Desired output :

[{ "domain": "www.google.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.yahoo.com", "location": "newyork", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR", { "domain": "www.google.com", "location": "texas", "time": "CDT UTC-0500", "function": "PACK", "unit": "PACK_ESR" },

{ "domain": "www.hortonworks.com", "location": "newyork", "time": "CDT UTC-0500", "function": "ALIGN", "unit": "ALIGN2" } ]

Highlighted

Re: Nifi JoltTransformJSON to remove duplicate Json records from Json array - Jolt

Rising Star

@srini one thing you could try is instead of using that one attribute to bucket on, create another attribute that denormalizes all of the attributes into one string, and use that column to bucket on (still leave the other attributes in place). When you have duplicate columns this would lead to those dupes being bucketed under that one column in the subsequent operation. Then the rest of the operations would pick the unique one and then remove the denormalized column. It's a bit of a dance but I think could work. Does that makes sense?

One thing I think would be good to get on the radar is upgrading Jolt in NiFi, perhaps once the modify feature upgrades from beta. I think that will help to simplify some of the hoops needed to do this type of work.

Don't have an account?
Coming from Hortonworks? Activate your account here