Support Questions

Find answers, ask questions, and share your expertise

What processor can I use to handle delimiters in a text file

avatar
Rising Star

I have text files that sometimes have various delimiters such as quotation marks, commas, and tabs etc what processor can I use to handle such delimiters and how to I configure it to handle such delimiters in my file? ConvertCsvToAvro "properties" section has similar properties I want to achieve.Thanks

1 ACCEPTED SOLUTION

avatar
Rising Star
@Bryan Bende your take on my comments above is much appreciated. Thanks

View solution in original post

10 REPLIES 10

avatar
Master Guru

What do you want to do with your text files? do you want to convert it to another format, if so what format?

There are many existing processors in NiFi that can manipulate text:

- ReplaceText can modify the content by finding a pattern and performing a replacement

- ExtractText can find patterns and extract them into attributes

- SplitText can split the lines of a text file

- ExecuteScript can apply a Groovy/Jython script to manipulate the text

It really depends what you want to do.

avatar
Rising Star

Hi @Bryan Bende thanks for getting back at me.To answer your question:

What do you want to do with your text files?

I want to be able to move the flat files from sftp containing delimiters into a sql database using nifi. How nifi handles the delimiters in the flat file are of concern to me? Which one of the processors stated above is handling the delimiters in the text file? I posted a sample data flow of what I want to achieve and it been answered by you here: https://community.hortonworks.com/questions/57779/how-to-preventing-duplicates-when-ingesting-into-m...

Do you want to convert it to another format, if so what format?

No I do not want to convert it into another format.

avatar
Master Guru

In your flow from the other post you are getting a file from FetchSFTP, then splitting each line with SplitText, then using ExtractText to parse out the values, and ReplaceText to construct a SQL statement.

The ExtractText processor is the one that needs to understand the delimiter in order get the value of each column. Since your flow was working you must have already configured ExtractText with a pattern to parse the line right? So are you just asking how to handle more delimiters?

avatar
Rising Star

@Bryan Bende Thanks once again for getting back.I prefer answering your questions in this pattern:

(1) Since your flow was working you must have already configured ExtractText with a pattern to parse the line right?

Yes I used regular expression(attached config image below)to parse the line but I do not think that is the best way to handle this.I would rather like to parse the lines using the delimiter.

8130-extracttextprocessorconfig1.jpg

8131-extracttextprocessconfig2.jpg

(2)So are you just asking how to handle more delimiters?

Yes that will helpful.

avatar
Rising Star
@Bryan Bende your input is much appreciated in regards to this question. Thanks

avatar
Master Guru

There isn't a specific processor that is made just for parsing delimited lines, mostly because you can do that with ExtractText already. You should only need one pattern to parse the whole line, lets say I have a simple CSV with 4 columns, you could have one property like this:

csv = (.+),(.+),(.+),(.+)

That will add attributes csv.1, csv.2, csv.3, csv.4 containing each respective column.

You could have different instances of ExtractText to handle the different types of delimiters, would need to route the data to each appropriately.

For a more user friendly option you could implement a custom processor like ParseCSV or ParseDelimited where you had a property to take the delimiter and then used some kind of parsing library, or your own code, to parse the line.

A second alternative is to write a Groovy or Jython script to do the parsing and use the ExecuteScript processor.

avatar
Rising Star

@Bryan Bende let's say I want to go the route where I use different ExtractText to handle different delimiters how I do about that?Am quite confused here(a vivid example will ve helpful). From my understanding, ExtractText processor will parse a file regardless of the delimiters of the file but what actually matters is the regular expression used to extract the data?correct me if wrong. I also tried replicating your example above, the ingestion of the flow file was successful but there was no data in data the databases tables.

avatar
Rising Star
@Bryan Bende your take on my comments above is much appreciated. Thanks

avatar
Master Guru

I'm not sure if this what you are looking for, but here is an example flow that generates data like:

A,B,C,D

1;2;3;4

It then splits it into two lines, routes the first line to an ExtractText setup to parse with commas, and routes the second line to a different ExtractText to parse semi-colons.

routeextractdelimitedtext.xml