Created 04-19-2017 09:02 PM
We have a PutHDFS processor with the attribute Conflict Resolution Strategy set to append so we can group together events based on a specific hour.
What we found is that it is concatenating the event from the previous data to the first event in the bin that is being concatenated. This causes the timestamp become text within the file and data loss occurs.
Example:
Apr 19 1:06:59 event data here event data here event data hereApr 19 1:07:00 event data here...
Should be
Apr 19 1:06:59 event data here event data here event data here
Apr 19 1:07:00 event data here...
Has anyone else experiencing this problem or a workaround?
Created 04-20-2017 01:38 PM
I wrote this on a different question yesterday, but related to the same question..
Regarding PutHDFS and appending, I believe this expected behavior... PutHDFS has no idea what it is writing to HDFS, its just writing bytes, which may or may not represent text. If you were appending parts of an image or video, there would be no such thing as new lines.
If you want a new line when you start appending, then you need the previously written data to end with a new line, or the next data to start with a new line. This should be easily done by manipulating the data in the flow before PutHDFS.
Created 04-20-2017 06:56 AM
You could use a ReplaceText processor to append a '\n' (line break) to each event before you route it to the PutHDFS processor.
Created 04-20-2017 01:19 PM
Thank you for this advice, but it requires me to use regex. The issue with this is that the end character(s) of events that may be coming in will be different for the type of setup we are using. I guess I can try to search for the start of the timestamp and try to put the "\n" before it which I will try.
Its unfortunate and seems like a gross oversight if this append features exists this way, combining the previous event and appended event together. I was hoping there was a solution that was integrated in the technology for this instead of a workaround.
Created 04-20-2017 01:38 PM
I wrote this on a different question yesterday, but related to the same question..
Regarding PutHDFS and appending, I believe this expected behavior... PutHDFS has no idea what it is writing to HDFS, its just writing bytes, which may or may not represent text. If you were appending parts of an image or video, there would be no such thing as new lines.
If you want a new line when you start appending, then you need the previously written data to end with a new line, or the next data to start with a new line. This should be easily done by manipulating the data in the flow before PutHDFS.
Created 04-20-2017 04:46 PM
I was thinking that too and Hellmar's answer gave me a clue as how to do it but using ReplaceText to add a newline doesn't allow me to specify "Add a new line right after a specific bin of events" or "Add a new line right before this first line in this bin of events" but rather it allows me to use regex to find keywords in the data (I am putting a newline before the timestamp which works but also adds an extra line after every event).
Is there a way to specify "put a newline at the end of this bin of events before the append happens" ?
Created 04-20-2017 04:53 PM
Using ReplaceText with the Replacement Strategy set to Prepend and Evaluation Mode set to Entire Text, will put the Replacement Value at the beginning of the content. Same thing could be done when using Replacement Strategy of Append to place the replacement at the end.
Alternatively, if you are using MergeContent (I can't remember) then you can use the Delimiter Strategy of Text and using the Header or Footer to enter a new line. You can use shift+enter as the property value for the Header or Footer to create a new line.
Created 04-20-2017 05:26 PM
Awesome! Yes! This is what I was looking for I think
Created 04-20-2017 05:42 PM
BTW I ended up using the Footer property in MergeContent and it worked wonderfully with no regex involved.