Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Using Nifi to do processing on CSV file before inserting into database

avatar
Expert Contributor

I have a CSV file that is messy.

I need to:

1. Get the date from the filename and use that as my date and append that to one of the columns.

2. Parse the CSV file to get the columns as the very last column is a string which has separators in the string ",".

The data looks like this.

Filename: ExampleFile_2017-09-20.LOG

Content:

23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True),
23:49:38.638,162929512814,$008EE9F6,-16777208,,,,,,,,,,Command Measure, Targets complete - Elapsed: 76064 ms,

The following is what will need to be inserted into the database:

2017-09-20 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,"Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True)"

2017-09-20 23:49:38.638,162929512814,$008EE9F6,-16777208,,,,,,,,,,"Command Measure, Targets complete - Elapsed: 76064 ms"

Would I need to do this inside of NiFi or some external script by calling some type of ExecuteScript?

1 ACCEPTED SOLUTION

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
4 REPLIES 4

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Expert Contributor

@Shu

Thank you for the great detailed response.

The first part does work but I don't think the regex will work for my case.

(Side bit, no fault of yours, I just absolutely despise regex as its unreadable to me and extremely difficult to debug (if at all).)

I should have mentioned this, but the only thing I know about the CSV file is that there are X number of columns before the string.

So I could see something like..

23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,Failed to fit max, attempts,(1=>3), fit failing entrely,(FitFailure=True),

The only thing I know is that there are 13 columns (commas) before the string and the string will always have a trailing "," (It has always been the last column in the row from what I have seen).

The other issue is I tried doing

(.*),

for all of the columns so I could then put it into a database query to insert the data but the regex seems to blowup and not function with so many columns (the original data has about 150 columns in it and I just truncated it down here).

avatar
Master Guru
@Kevin Vasko

If you know 13 columns before the string then you can use Replace Text processor that matches until 13th comma then match everything after 13th comma until last comma(not including last comma).

Replace text processor Configs:-

Change the below properties as

Search Value

^((?:[^,]*,){13})(.*), //capture group 1 until 13th comma and capture group 2 after 13th comma until last comma

Replacement Value

$1"$2" //keep 1 group as is and add double quotes for 2 group

Maximum Buffer Size

1 MB //if your flowfile content is more than 1 MB then you need to change this value.

Replacement Strategy

Regex Replace

Evaluation Mode

Line-by-Line

Configs:-

47438-replacetext.png

By using above search value property we are just replacing as is until 13th comma and matching whole content after 13th comma until last comma and enclosing whole content in double quotes(").

If you are thinking that regex will blow up by using above replace text processor configurations then you can use

Split Text processor before Replace Text processor to split big csv file each flowfile having 1 line,Then use

Replace Text Processor will work with 1 small flowfile, After replace text you can use

Merge Content Processor(if you want to merge small flowfiles into 1 big file again).

avatar
Expert Contributor

Thanks! That seems to work correctly. I'll mark this as the answer as it produces the answer I'm looking for.