Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to remove the header when using NiFi SplitText processor

avatar
Expert Contributor

Hello,

I have a csv file with the first line as header.

When I use SplitText processor, the split tiny files contain that header as in first line.

Is there an easy way to generate the split file without header?

Thanks.

1 ACCEPTED SOLUTION

avatar
Master Guru

You could set the Header Line Count to 0, then send the flowfiles to a RouteOnAttribute processor where you can "skip" the first line by routing on the following Expression Language statement:

${fragment.index:gt(0)}

The first line will be routed to "unmatched" and the rest to "matched" or the user-defined property name (depending on the value of the Routing Strategy property). Note that this requires the Line Split Count property be set to 1 in SplitText.

Alternatively, if you are using (or can upgrade to) NiFi 1.3.0, you can use a record-aware processor with a CSVReader. This reader can be configured to (among other things) skip the header line. The record-aware processors also offer better performance when working with flow files that contain many "records" (such as a CSV file where each "record" is a row).

View solution in original post

7 REPLIES 7

avatar
Master Guru

You could set the Header Line Count to 0, then send the flowfiles to a RouteOnAttribute processor where you can "skip" the first line by routing on the following Expression Language statement:

${fragment.index:gt(0)}

The first line will be routed to "unmatched" and the rest to "matched" or the user-defined property name (depending on the value of the Routing Strategy property). Note that this requires the Line Split Count property be set to 1 in SplitText.

Alternatively, if you are using (or can upgrade to) NiFi 1.3.0, you can use a record-aware processor with a CSVReader. This reader can be configured to (among other things) skip the header line. The record-aware processors also offer better performance when working with flow files that contain many "records" (such as a CSV file where each "record" is a row).

avatar
Master Guru

@Alvin Jin To answer your question about which processors to use: it depends on what you want to do with the whole CSV file. Your question only mentions splitting and ignoring the header, the CSVReader takes care of that. The record-aware processors in NiFi 1.3.0 include:

ConsumeKafkaRecord_0_10: Gets messages from a Kafka topic, bundles into a single flow file instead of one per message

ConvertRecord: Converts records from one data format to another (Avro to JSON, e.g.)

LookupRecord: Uses fields from a record to lookup a value, which can be added back to the record

PartitionRecord: Groups "like" records (based on user-provided criteria) into individual flow files

PublishKafkaRecord_0_10: Posts messages to a Kafka topic

PutDatabaseRecord: Executes a specified operation (INSERT, UPDATE, DELETE, e.g.) on a database for each record in a flow file

PutElasticsearchHttpRecord: Executes a specified operation ("index", e.g.) on an Elasticsearch cluster for each record in a flow file

QueryRecord: execute SQL queries on fields from the records. This can be used to filter, aggregate, etc.

SplitRecord: Splits records into smaller flow files. Usually only used when downstream processors are not record-aware

UpdateRecord: Updates field(s) in each record of a flow file

Also I wanted to mention, if for some reason all your CSV columns are strings, you can set "Schema Access Strategy to "Use String Fields From Header", and then you don't need a schema or schema registry. Otherwise if you want to provide a schema, you're not required to use a schema registry, you can just paste your schema into the Schema Text property. and set "Schema Access Strategy" to "Use Schema Text Property".

avatar
Expert Contributor

In my case, the csv columns are not all strings, there are long types.

Yes, I can provide schema text without using Schema Registry.

For your first solution, I think the index starts from 1. ${fragment.index:gt(1)}

Thanks.

avatar
Master Guru

SplitText for some reason starts the index at 1, the other Split processors start at 0. Sorry I had forgotten that difference, good catch!

avatar
Explorer

Hi @mburgess  & @alvinuw 

Currently i want to load the txt file(not csv) into postgres. i want to remove the header for the txt file

I have use this processors (ListenFile-FetchFile-Splitext-RouteOnAttribut and ReplaceText(for regex).I try your propose but it's no okay for me

please can you did me what i doing

 

you find Attached the screenshot

 

RemoveHeader.PNG

avatar
New Contributor

@mburgess I used your 1st suggestion and it worked like a charm with just one exception. The header row was index 1. I'm not sure if was just me, my data, or some property/attribute I set wrong. Just thought you should know. So, after modifying the user-defined attribute value to ${fragment.index:gt(1)} it worked. And, in case you ask, the header row is the first row in the CSV file which doesn't make sense unless the processor logic changed to 1-based indexing instead of 0-based indexing.

 

Also, thanks for all of your blog posts. I use your suggestions a lot.

avatar
Expert Contributor

Hi @Matt Burgess

Thank you for your response.

The first solution works for me.

For the second solution, may I ask which processors should I use, since CSVReader is a service, which also requires schema and schema registry.

Thanks.