Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Using Nifi to do processing on CSV file before inserting into database

avatar
Expert Contributor

I have a CSV file that is messy.

I need to:

1. Get the date from the filename and use that as my date and append that to one of the columns.

2. Parse the CSV file to get the columns as the very last column is a string which has separators in the string ",".

The data looks like this.

Filename: ExampleFile_2017-09-20.LOG

Content:

23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True),
23:49:38.638,162929512814,$008EE9F6,-16777208,,,,,,,,,,Command Measure, Targets complete - Elapsed: 76064 ms,

The following is what will need to be inserted into the database:

2017-09-20 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,"Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True)"

2017-09-20 23:49:38.638,162929512814,$008EE9F6,-16777208,,,,,,,,,,"Command Measure, Targets complete - Elapsed: 76064 ms"

Would I need to do this inside of NiFi or some external script by calling some type of ExecuteScript?

1 ACCEPTED SOLUTION

avatar
Master Guru

@Kevin Vasko

Yes we can do your case using NiFi Processors without using any external scripts.

Here is what i tried:-

First you need to extract the date from filename and keep it as attribute to the flowfile by using

Update Attribute processor:-

add new property

date

${filename:substringAfter('_'):substringBefore('.')}

Configs:-

47430-update.png

Then use Replace text processor to prepend the existing flowfile csv data with extracted date attribute

Change the

Replacement Value

${date}

**keep space after ${date} in above.

Maximum Buffer Size

1 MB //if your flowfile content is more than 1 MB then you need to change this value.

Replacement Strategy

Prepend

Evaluation Mode

Line-by-Line

Configs:-

47431-replacetext.png

So in this processor we are just prepending the data with date attribute that is extracted in Update attribute processor and we are going to prepend date value Line-by-Line evoluation mode.

Then use another replace text processor to replace last , with double quotes "

Search Value

(.*),(.*),(.*),

Replacement Value

$1,"$2,$3"

Maximum Buffer Size

1 MB //if your flowfile content is more than 1 MB then you need to change this value.

Replacement Strategy

Regex Replace

Evaluation Mode

Line-by-Line

Configs:-

47432-replacetext.png

So in this processor we are extracting all the data between last 2 commas and then replacing first group as is and enclosing group2 and group3 with double quotes and removing , at last.

Output:-

2017-09-20 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,"Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True)"
2017-09-20 23:49:38.638,162929512814,$008EE9F6,-16777208,,,,,,,,,,"Command Measure, Targets complete - Elapsed: 76064 ms" 

I have attached the xml file you can use the same xml file and change the configs as per your needs.

Flow:-

47433-flow.png

Reference xml:-

date-155661.xml

Let me know if you have any issues..!!

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.

View solution in original post

4 REPLIES 4

avatar
Master Guru

@Kevin Vasko

Yes we can do your case using NiFi Processors without using any external scripts.

Here is what i tried:-

First you need to extract the date from filename and keep it as attribute to the flowfile by using

Update Attribute processor:-

add new property

date

${filename:substringAfter('_'):substringBefore('.')}

Configs:-

47430-update.png

Then use Replace text processor to prepend the existing flowfile csv data with extracted date attribute

Change the

Replacement Value

${date}

**keep space after ${date} in above.

Maximum Buffer Size

1 MB //if your flowfile content is more than 1 MB then you need to change this value.

Replacement Strategy

Prepend

Evaluation Mode

Line-by-Line

Configs:-

47431-replacetext.png

So in this processor we are just prepending the data with date attribute that is extracted in Update attribute processor and we are going to prepend date value Line-by-Line evoluation mode.

Then use another replace text processor to replace last , with double quotes "

Search Value

(.*),(.*),(.*),

Replacement Value

$1,"$2,$3"

Maximum Buffer Size

1 MB //if your flowfile content is more than 1 MB then you need to change this value.

Replacement Strategy

Regex Replace

Evaluation Mode

Line-by-Line

Configs:-

47432-replacetext.png

So in this processor we are extracting all the data between last 2 commas and then replacing first group as is and enclosing group2 and group3 with double quotes and removing , at last.

Output:-

2017-09-20 23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,"Failed to fit max attempts (1=>3), fit failing entirely (Fit Failure=True)"
2017-09-20 23:49:38.638,162929512814,$008EE9F6,-16777208,,,,,,,,,,"Command Measure, Targets complete - Elapsed: 76064 ms" 

I have attached the xml file you can use the same xml file and change the configs as per your needs.

Flow:-

47433-flow.png

Reference xml:-

date-155661.xml

Let me know if you have any issues..!!

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.

avatar
Expert Contributor

@Shu

Thank you for the great detailed response.

The first part does work but I don't think the regex will work for my case.

(Side bit, no fault of yours, I just absolutely despise regex as its unreadable to me and extremely difficult to debug (if at all).)

I should have mentioned this, but the only thing I know about the CSV file is that there are X number of columns before the string.

So I could see something like..

23:49:38.637,162929511757,$009389BF,36095,,,,,,,,,,Failed to fit max, attempts,(1=>3), fit failing entrely,(FitFailure=True),

The only thing I know is that there are 13 columns (commas) before the string and the string will always have a trailing "," (It has always been the last column in the row from what I have seen).

The other issue is I tried doing

(.*),

for all of the columns so I could then put it into a database query to insert the data but the regex seems to blowup and not function with so many columns (the original data has about 150 columns in it and I just truncated it down here).

avatar
Master Guru
@Kevin Vasko

If you know 13 columns before the string then you can use Replace Text processor that matches until 13th comma then match everything after 13th comma until last comma(not including last comma).

Replace text processor Configs:-

Change the below properties as

Search Value

^((?:[^,]*,){13})(.*), //capture group 1 until 13th comma and capture group 2 after 13th comma until last comma

Replacement Value

$1"$2" //keep 1 group as is and add double quotes for 2 group

Maximum Buffer Size

1 MB //if your flowfile content is more than 1 MB then you need to change this value.

Replacement Strategy

Regex Replace

Evaluation Mode

Line-by-Line

Configs:-

47438-replacetext.png

By using above search value property we are just replacing as is until 13th comma and matching whole content after 13th comma until last comma and enclosing whole content in double quotes(").

If you are thinking that regex will blow up by using above replace text processor configurations then you can use

Split Text processor before Replace Text processor to split big csv file each flowfile having 1 line,Then use

Replace Text Processor will work with 1 small flowfile, After replace text you can use

Merge Content Processor(if you want to merge small flowfiles into 1 big file again).

avatar
Expert Contributor

Thanks! That seems to work correctly. I'll mark this as the answer as it produces the answer I'm looking for.