Archives of Support Questions (Read Only)

alvinuw · ‎06-28-2017

Hello,

I have a csv file with the first line as header.

When I use SplitText processor, the split tiny files contain that header as in first line.

Is there an easy way to generate the split file without header?

Thanks.

mburgess · ‎06-28-2017

You could set the Header Line Count to 0, then send the flowfiles to a RouteOnAttribute processor where you can "skip" the first line by routing on the following Expression Language statement:

${fragment.index:gt(0)}

The first line will be routed to "unmatched" and the rest to "matched" or the user-defined property name (depending on the value of the Routing Strategy property). Note that this requires the Line Split Count property be set to 1 in SplitText.

Alternatively, if you are using (or can upgrade to) NiFi 1.3.0, you can use a record-aware processor with a CSVReader. This reader can be configured to (among other things) skip the header line. The record-aware processors also offer better performance when working with flow files that contain many "records" (such as a CSV file where each "record" is a row).

View solution in original post

mburgess · ‎06-28-2017

You could set the Header Line Count to 0, then send the flowfiles to a RouteOnAttribute processor where you can "skip" the first line by routing on the following Expression Language statement:

${fragment.index:gt(0)}

The first line will be routed to "unmatched" and the rest to "matched" or the user-defined property name (depending on the value of the Routing Strategy property). Note that this requires the Line Split Count property be set to 1 in SplitText.

Alternatively, if you are using (or can upgrade to) NiFi 1.3.0, you can use a record-aware processor with a CSVReader. This reader can be configured to (among other things) skip the header line. The record-aware processors also offer better performance when working with flow files that contain many "records" (such as a CSV file where each "record" is a row).

mburgess · ‎06-30-2017

@Alvin Jin To answer your question about which processors to use: it depends on what you want to do with the whole CSV file. Your question only mentions splitting and ignoring the header, the CSVReader takes care of that. The record-aware processors in NiFi 1.3.0 include:

ConsumeKafkaRecord_0_10: Gets messages from a Kafka topic, bundles into a single flow file instead of one per message

ConvertRecord: Converts records from one data format to another (Avro to JSON, e.g.)

LookupRecord: Uses fields from a record to lookup a value, which can be added back to the record

PartitionRecord: Groups "like" records (based on user-provided criteria) into individual flow files

PublishKafkaRecord_0_10: Posts messages to a Kafka topic

PutDatabaseRecord: Executes a specified operation (INSERT, UPDATE, DELETE, e.g.) on a database for each record in a flow file

PutElasticsearchHttpRecord: Executes a specified operation ("index", e.g.) on an Elasticsearch cluster for each record in a flow file

QueryRecord: execute SQL queries on fields from the records. This can be used to filter, aggregate, etc.

SplitRecord: Splits records into smaller flow files. Usually only used when downstream processors are not record-aware

UpdateRecord: Updates field(s) in each record of a flow file

Also I wanted to mention, if for some reason all your CSV columns are strings, you can set "Schema Access Strategy to "Use String Fields From Header", and then you don't need a schema or schema registry. Otherwise if you want to provide a schema, you're not required to use a schema registry, you can just paste your schema into the Schema Text property. and set "Schema Access Strategy" to "Use Schema Text Property".

alvinuw · ‎06-30-2017

In my case, the csv columns are not all strings, there are long types.

Yes, I can provide schema text without using Schema Registry.

For your first solution, I think the index starts from 1. ${fragment.index:gt(1)}

Thanks.

mburgess · ‎06-30-2017

SplitText for some reason starts the index at 1, the other Split processors start at 0. Sorry I had forgotten that difference, good catch!

Lamtoro · ‎12-22-2020

Hi @mburgess & @alvinuw

Currently i want to load the txt file(not csv) into postgres. i want to remove the header for the txt file

I have use this processors (ListenFile-FetchFile-Splitext-RouteOnAttribut and ReplaceText(for regex).I try your propose but it's no okay for me

please can you did me what i doing

you find Attached the screenshot

alencosoft · ‎09-29-2021

@mburgess I used your 1st suggestion and it worked like a charm with just one exception. The header row was index 1. I'm not sure if was just me, my data, or some property/attribute I set wrong. Just thought you should know. So, after modifying the user-defined attribute value to ${fragment.index:gt(1)} it worked. And, in case you ask, the header row is the first row in the CSV file which doesn't make sense unless the processor logic changed to 1-based indexing instead of 0-based indexing.

Also, thanks for all of your blog posts. I use your suggestions a lot.

alvinuw · ‎06-30-2017

Hi @Matt Burgess

Thank you for your response.

The first solution works for me.

For the second solution, may I ask which processors should I use, since CSVReader is a service, which also requires schema and schema registry.

Thanks.

Cloudera Community

Archives of Support Questions (Read Only)

How to remove the header when using NiFi SplitText processor