Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HOW CAN I MERGE DIFFERENT FLOW FILE TOGETHER ?

Highlighted

HOW CAN I MERGE DIFFERENT FLOW FILE TOGETHER ?

Contributor

Hi everyone,

let me explain myself, I have a Kafka topic wich containt many event with the same JSON schemas.

example:

{"timestamp_da":"2016-11-08T02:07:15.208+01:00","PROGRAM":"AAA","PRIORITY":"notice"}{"timestamp_da":"2016-11-08T02:07:15.208+01:00","PROGRAM":"BBB","PRIORITY":"notice"}{"timestamp_da":"2016-11-08T02:07:15.208+01:00","PROGRAM":"BBB","PRIORITY":"notice"}{"timestamp_da":"2016-11-08T02:07:15.208+01:00","PROGRAM":"AAA","PRIORITY":"notice"}

the thing is i want to merge all the event which contain "PROGRAM":"AAA" together in a single file before storing it all into HDFS.

I have tried with mergecontent but all i get is group them together in many file, what I want is group them together in a single big file or at least a file with 128Mo blocksize.

I don't know if i made myself clear, if you need more information, i'm at your service.

Thanks for your help

2 REPLIES 2
Highlighted

Re: HOW CAN I MERGE DIFFERENT FLOW FILE TOGETHER ?

Master Guru

@Toky Raobelina

You could try using splitJSON --> extractText --> mergeContent --> PutHDFS Use SplitJSON to split your JSON in to individual FlowFiles.

Use ExtractText to create a new FlowFile attribute that contains the "PROGRAM" value (Ie. - AAA or BBB)

Use MergeContent with correlation attribute set to the attribute named created by ExtractText.

Then write the merged files to HDFS.

Re: HOW CAN I MERGE DIFFERENT FLOW FILE TOGETHER ?

Contributor

Hi mclark, thanks for your answer

the thing is i have already used the splitJSON procesor but it create too many flowfile and it's not very good in term of performance , you can see in the png below what i have done so far.

9232-nifi-template.png

EvaluateJSONpath i have build the JSON schemas

RouteOnAttribute because sometimes PROGRAM is empty and i keep only those who are not

PutHDFS in the option directory i have : /nifi/program/${PROGRAM}/${timestamp_da:toDate("yyyy-MM-dd'T'HH:mm:ss.SSSXXX"):format('yyyyMMdd')}

==> it is working fine and i have the performance i want but i have too many little file , each event represent 1 file of 255bite

Don't have an account?
Coming from Hortonworks? Activate your account here