Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Using ExtractText to extract attributes for each column in a CSV

Using ExtractText to extract attributes for each column in a CSV

Contributor

I've googled everywhere for this and everything I run across its super complicated. It should be relatively simple to do. The recommendations show to look at the "Example_With_CSV.xml" from here.

So given a flowfile thats a CSV.

2017-09-20 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,"Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True)"

I need

$date = 2017-09-20 23:49:49:38.637

$id = 162929511757

...

$instanceid = 36095

$comment = "Failed, to fit max attempts (1=>3), fit failing, entirely (Fit Failure=True)"

OR

$csv.date = ...

$csv.id = ...

...

$csv.instanceid = ...

$csv.comment = ..

Is there another easier option to do this besides RegEx? I can't stand to do anything with RegEx as how unreadable, and overly complicated they are. To me there should be a significantly easier way of doing this than with RegEx.

https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates but it doesn't have anything in there related to actually getting the columns of each value out.

2 REPLIES 2

Re: Using ExtractText to extract attributes for each column in a CSV

New Contributor

Hi Kevin,

Are you using a version of NiFi greater than or equal to v1.2? If so you will have access to the Record-Oriented processors available in these versions of NiFi which provide a great way to handle CSV data and should provide significant performance improvements. Mark Payne provides an excellent example here of some of these processors and their advantages. There are also other articles on Record-Oriented processors on this community site here and from other NiFi rockstars like @Bryan Bende on their personal blogs here. I’m not sure of your downstream usecase but in my experience storing CSV data as attributes on flow files is most commonly used for flow file routing or data restructuring and persistence (hdfs, a database etc.). If this is one of your intended downstream purposes once the data is in Record-Oriented form you get quite a lot of flexibility in what you can do with it including executing SQL queries against your records using a QueryRecord processor, and inserting your records directly into a database using a PutDatabaseRecord processor. There is a little additional set-up with this process as you need to define an Avro schema that represents your CSV file (and select/define a schema registry) but conveniently NiFi even has an out of the box InferAvroSchema processor to help you fast track this process (See @Timothy Spann article here). If you check the release notes for each NiFi version you’ll be able to see what processors are available to you as the Record-Oriented paradigm is relatively new and growing quickly with each NiFi release. If you’re not at NiFi v1.2 I’d suggest upgrading to the latest NiFi version if possible as having gone through the upgrade process a number of times it’s a relatively straight forward provided you have configured your NiFi instance according to the suggested best practice. The Record-Oriented processors are an exciting addition to the NiFi toolkit so I’d suggest investing the time to embrace them as they will make your flows cleaner, more performant, and provide greater flexibility in how you handle your data.

Re: Using ExtractText to extract attributes for each column in a CSV

Contributor

I ended up not using NiFi for this. Looking back I tried forcing a solution out of NiFi thst wasn’t a good fit. I spent several weeks and entirely too long trying to solve the most simple case of this project (formatting some text and dumping it to a db).

I could certainly see NiFi being useful for moving source data files around from the folders I’m working with (copying, moving etc.) but doing any amount of logic or manipulation of anything but a happy path is extremely tedious and seemingly difficult to do.

Knowing that I was going to have to do a lot more work on the data to make it even close to usable, I just scrapped NiFi and implement it in Python.

After dealing with this data and running into edge cases over and over again that I wasn’t even aware about when I wrote this topic, the data IMO was just too dirty and had too many exceptions to deal with, with NiFi. On top of that this wasn’t just the import of the data, not even using it so I would have had to have another tool to actually process the data to put it into a usable form anyways.

Appreciate the response. You took the time to respond so I figured it was reasonable to respond even though I didn’t end up using the solution.

Don't have an account?
Coming from Hortonworks? Activate your account here