Support Questions

Find answers, ask questions, and share your expertise

Apache Nifi processor to convert 'Control A' (\u0001) separated file to AVRO

avatar
Rising Star

I am in need to convert a CTRL A separated file into AVRO. How can I do this? I tried using ConvertCSVToAvro Processor and tried to pass an expression to 'CSV delimiter' property. It does not accept regular expression. Is there any other way to achieve this in Apache Nifi?

7 REPLIES 7

avatar

You should be able to use ReplaceText processor to change the \u0001 delimiter to whatever you like (new line, comma, etc.) and then use ConvertCSVToAvro with a literal CSV delimiter. I haven't tried, but does CSV delimiter need to be a regular expression in order to identify Unicode? Is it just a matter of escaping the \ in the literal?

avatar
Master Guru

On a Mac, I found a useful procedure here for enabling the pasting of a Unicode character into the current text box. Using this, I opened the ConvertCSVToAvro processor dialog, then the CSV Delimiter property value dialog. Then using the procedure I selected character \u0001, which pastes it into the property (although it is not a visible character so you won't see it on the screen). Click OK then Apply and the delimiter should be set to \u0001. I tried this with a simple example and it worked.

On Windows I think you can use the Character Map or something similar, but the idea is to either have some utility copy a unicode character to the clipboard for pasting into the property value dialog, or perhaps it will paste for you (like the Mac utility).

Once NIFI-2369 is resolved, there might be a way to use Expression Language to make this more visible, like ${literal('\u0001')} or something.

Alternatively, you could use a scripting processor like ExecuteScript and do the split with code (Javascript, Groovy, e.g.)

avatar
Rising Star

Thanks @Matt Burgess and @Andy LoPresto. My requirement here is, I am keeping watch on a folder where my input files come in. These files can have different delimiters \u0001, \u0002 or a CSV. So, I was thinking if we can convert these files into AVRO by passing the delimiter as an attribute in the flowfile.

Is there an efficient way to handle this other than using replace text.

avatar
Master Guru

How do you know which delimiter is used for a particular file? If you can determine that from the content, you might be able to use RouteContent to send all \u0001-delimited files to one ConvertCSVToAvro (using the technique I describe above), all \u0002 files to another, and so on. Likewise if you can somehow extract the delimiter into an attribute you can use RouteOnAttribute rather than RouteContent.

Why would you like to avoid ReplaceText? The content of the flow files will be altered when converting to Avro, so you won't have the original input at that point. If it is a performance issue, do you think my suggestion above would work for your use case?

avatar
Rising Star

@Matt Burgess we decided to do a routeOnAttribute for files which are separated by Control A or CSV.

We now face a new challenge while using Control A separated file. What we do is, we get this delimiter as a flowFile attribute and we use this attribute say ${srcdelim} (whose value is - \u0001) in ReplaceText Processor. The resultant file is not Control A separated file. I just see a character \u0001 instead.

can you please help me replace a control A character

avatar
Master Guru

What do your search and replace regular expressions look like? You might be matching the whole text looking for the delimiter and replacing with the delimiter, when really it sounds like you want to match the delimiter and replace with something else (a comma, e.g.)

avatar
Rising Star

I am trying to Append each line with a time-stamp or a date like

${srcdelimiter}${now():format('mm-dd-yy')}