Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi's PutHDFS processor attribute Conflict Resolution Strategy: Append causing data loss

avatar
Expert Contributor

We have a PutHDFS processor with the attribute Conflict Resolution Strategy set to append so we can group together events based on a specific hour.

What we found is that it is concatenating the event from the previous data to the first event in the bin that is being concatenated. This causes the timestamp become text within the file and data loss occurs.

Example:

Apr 19 1:06:59 event data here event data here event data hereApr 19 1:07:00 event data here...

Should be

Apr 19 1:06:59 event data here event data here event data here

Apr 19 1:07:00 event data here...

Has anyone else experiencing this problem or a workaround?

1 ACCEPTED SOLUTION

avatar
Master Guru

I wrote this on a different question yesterday, but related to the same question..

Regarding PutHDFS and appending, I believe this expected behavior... PutHDFS has no idea what it is writing to HDFS, its just writing bytes, which may or may not represent text. If you were appending parts of an image or video, there would be no such thing as new lines.

If you want a new line when you start appending, then you need the previously written data to end with a new line, or the next data to start with a new line. This should be easily done by manipulating the data in the flow before PutHDFS.

View solution in original post

7 REPLIES 7

avatar

You could use a ReplaceText processor to append a '\n' (line break) to each event before you route it to the PutHDFS processor.

avatar
Expert Contributor

Thank you for this advice, but it requires me to use regex. The issue with this is that the end character(s) of events that may be coming in will be different for the type of setup we are using. I guess I can try to search for the start of the timestamp and try to put the "\n" before it which I will try.

Its unfortunate and seems like a gross oversight if this append features exists this way, combining the previous event and appended event together. I was hoping there was a solution that was integrated in the technology for this instead of a workaround.

avatar
Master Guru

I wrote this on a different question yesterday, but related to the same question..

Regarding PutHDFS and appending, I believe this expected behavior... PutHDFS has no idea what it is writing to HDFS, its just writing bytes, which may or may not represent text. If you were appending parts of an image or video, there would be no such thing as new lines.

If you want a new line when you start appending, then you need the previously written data to end with a new line, or the next data to start with a new line. This should be easily done by manipulating the data in the flow before PutHDFS.

avatar
Expert Contributor

I was thinking that too and Hellmar's answer gave me a clue as how to do it but using ReplaceText to add a newline doesn't allow me to specify "Add a new line right after a specific bin of events" or "Add a new line right before this first line in this bin of events" but rather it allows me to use regex to find keywords in the data (I am putting a newline before the timestamp which works but also adds an extra line after every event).

Is there a way to specify "put a newline at the end of this bin of events before the append happens" ?

avatar
Master Guru

Using ReplaceText with the Replacement Strategy set to Prepend and Evaluation Mode set to Entire Text, will put the Replacement Value at the beginning of the content. Same thing could be done when using Replacement Strategy of Append to place the replacement at the end.

Alternatively, if you are using MergeContent (I can't remember) then you can use the Delimiter Strategy of Text and using the Header or Footer to enter a new line. You can use shift+enter as the property value for the Header or Footer to create a new line.

avatar
Expert Contributor

Awesome! Yes! This is what I was looking for I think

avatar
Expert Contributor

BTW I ended up using the Footer property in MergeContent and it worked wonderfully with no regex involved.