Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to delete a row/drop a column?

Solved Go to solution

How to delete a row/drop a column?

I am new to Apache nifi and trying to evaluate tool for given business scenarios.

I ingested csv files using ListFile and FetchFile Processor, now want to perform below transformations on below data:

1. To drop a column from this ingested data.(I can specify column name for the action)

2. To delete a row from the ingested data using specific condition.

3. To perform transformations on null values present in the data

Thanks in advance!!

Regards,

Garima.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: How to delete a row/drop a column?

Guru

The best solution seems to be ... after FetchFile:

  • SplitText processor (into single lines), then
  • ExtractText processor (delete row) to regex match rows you want to discard, and connect to next processor with unmatched
  • ReplaceText processor (drop column) to regex find a column and replace empty value
    • Search Value e.g.: (.*,){2} to match the third column
    • Replacement Value: ''
    • need special attention to delim for first and last columns because there will be only trailing or leading
  • ReplaceText processor (transform nulls) to regex find a column and replace empty value
    • Search Value regex find all nulls
    • Replacement Value: what you want to transfer to
  • From here you can merge lines if you need to batch or keep text split as single lines (records) if you want to stream.

A bit challenging on the regex skills, but completely within the realm of regex operations.

See the HCC article for a full working example: https://community.hortonworks.com/articles/66861/nifi-etl-removing-columns-filtering-rows-changing.h...

View solution in original post

7 REPLIES 7
Highlighted

Re: How to delete a row/drop a column?

Guru

The best solution seems to be ... after FetchFile:

  • SplitText processor (into single lines), then
  • ExtractText processor (delete row) to regex match rows you want to discard, and connect to next processor with unmatched
  • ReplaceText processor (drop column) to regex find a column and replace empty value
    • Search Value e.g.: (.*,){2} to match the third column
    • Replacement Value: ''
    • need special attention to delim for first and last columns because there will be only trailing or leading
  • ReplaceText processor (transform nulls) to regex find a column and replace empty value
    • Search Value regex find all nulls
    • Replacement Value: what you want to transfer to
  • From here you can merge lines if you need to batch or keep text split as single lines (records) if you want to stream.

A bit challenging on the regex skills, but completely within the realm of regex operations.

See the HCC article for a full working example: https://community.hortonworks.com/articles/66861/nifi-etl-removing-columns-filtering-rows-changing.h...

View solution in original post

Highlighted

Re: How to delete a row/drop a column?

Thanks @Greg Keys for your quick and crisp replies for all the queries.

- I am able to delete the row using ExtractText Processor but the issue i am facing is getting a single line file ie. If i am having 10 number of rows initially then the transformed output file is having all the rows merged in a single row and thus 10 rows changed to 1 row, with all the data.

- Unfortunately, i am not able to drop the column using Replace Text, as I am not able to get full column values.

For example, if my column header is "uuid" which i need to drop, i put this into regex as : (uuid) and replace text with " ", it is only changing the header name in spite of blanking the full column.

- For null values replacement, i am using ReplaceText processor, properties: regex - (^\s+$) for checking empty values in the excel and then replacing the same using replaceText - "hi" but this is also not working.

- I can merge any number of files using MergeText, but it is appending all the data one after the other, is it feasible to join the data or merge two files on the basis of one common header. for eg. I have uuid column in one file with 5 different cols and the same column in other file with 6 more different col, can i join here both the files and get 12 cols dataset in output file(having one common comparison column). You can treat this case as an equijoin condition.

Many Thanks in advance!

Regards,

Garima.

Highlighted

Re: How to delete a row/drop a column?

Guru

@Garima Verma See the HCC article posted to the main answer. This will resolve your issues. Regarding merging files, I suggest posing this as a separate question in HCC ... it is distinct from your original question, and you will get more eyes on it :)

Highlighted

Re: How to delete a row/drop a column?

Thanks. I will compare both the flows and will try again today.. :)

I started a new thread for merging of data :

https://community.hortonworks.com/questions/66916/how-to-perform-equijoin-on-files-while-merging-the...

Great help n support !!

-Garima.

Highlighted

Re: How to delete a row/drop a column?

Every flow is perfectly working.. thanks a lot..!

Can we check blank space also and replace it with some text value?

i used Search Value as " " but it didnt work.. :(

Could you please check dis one too once..

Many thanks!

-Garima.

Highlighted

Re: How to delete a row/drop a column?

Expert Contributor

For deleting a column or performing transformations on Null, I would use the UpdateAttribute processor.

If you want to delete a row I'd use the RouteOnAttribute. (I would then route this to HDFS to log that I 'deleted' it.)

I hope this helps.

Highlighted

Re: How to delete a row/drop a column?

9494-updateattributeconfig.png

Thanks @Matt Andruff... I tried to drop column using UpdateAttribute but unfortunately still unsuccessful

I think i am missing some config in the processors. Please find attached for details :

9492-deletecol.png

and my config for unique attribute is as below wherein "uuid" is my column name i want to drop from ingested data.

Please let me know the required modifications here

Thanks in advance!

Regards,

Garima.

Don't have an account?
Coming from Hortonworks? Activate your account here