Created 06-28-2017 08:18 PM
Hello,
I have a csv file with the first line as header.
When I use SplitText processor, the split tiny files contain that header as in first line.
Is there an easy way to generate the split file without header?
Thanks.
Created 06-28-2017 08:32 PM
You could set the Header Line Count to 0, then send the flowfiles to a RouteOnAttribute processor where you can "skip" the first line by routing on the following Expression Language statement:
${fragment.index:gt(0)}
The first line will be routed to "unmatched" and the rest to "matched" or the user-defined property name (depending on the value of the Routing Strategy property). Note that this requires the Line Split Count property be set to 1 in SplitText.
Alternatively, if you are using (or can upgrade to) NiFi 1.3.0, you can use a record-aware processor with a CSVReader. This reader can be configured to (among other things) skip the header line. The record-aware processors also offer better performance when working with flow files that contain many "records" (such as a CSV file where each "record" is a row).
Created 06-28-2017 08:32 PM
You could set the Header Line Count to 0, then send the flowfiles to a RouteOnAttribute processor where you can "skip" the first line by routing on the following Expression Language statement:
${fragment.index:gt(0)}
The first line will be routed to "unmatched" and the rest to "matched" or the user-defined property name (depending on the value of the Routing Strategy property). Note that this requires the Line Split Count property be set to 1 in SplitText.
Alternatively, if you are using (or can upgrade to) NiFi 1.3.0, you can use a record-aware processor with a CSVReader. This reader can be configured to (among other things) skip the header line. The record-aware processors also offer better performance when working with flow files that contain many "records" (such as a CSV file where each "record" is a row).
Created 06-30-2017 05:11 PM
@Alvin Jin To answer your question about which processors to use: it depends on what you want to do with the whole CSV file. Your question only mentions splitting and ignoring the header, the CSVReader takes care of that. The record-aware processors in NiFi 1.3.0 include:
ConsumeKafkaRecord_0_10: Gets messages from a Kafka topic, bundles into a single flow file instead of one per message
ConvertRecord: Converts records from one data format to another (Avro to JSON, e.g.)
LookupRecord: Uses fields from a record to lookup a value, which can be added back to the record
PartitionRecord: Groups "like" records (based on user-provided criteria) into individual flow files
PublishKafkaRecord_0_10: Posts messages to a Kafka topic
PutDatabaseRecord: Executes a specified operation (INSERT, UPDATE, DELETE, e.g.) on a database for each record in a flow file
PutElasticsearchHttpRecord: Executes a specified operation ("index", e.g.) on an Elasticsearch cluster for each record in a flow file
QueryRecord: execute SQL queries on fields from the records. This can be used to filter, aggregate, etc.
SplitRecord: Splits records into smaller flow files. Usually only used when downstream processors are not record-aware
UpdateRecord: Updates field(s) in each record of a flow file
Also I wanted to mention, if for some reason all your CSV columns are strings, you can set "Schema Access Strategy to "Use String Fields From Header", and then you don't need a schema or schema registry. Otherwise if you want to provide a schema, you're not required to use a schema registry, you can just paste your schema into the Schema Text property. and set "Schema Access Strategy" to "Use Schema Text Property".
Created 06-30-2017 06:14 PM
In my case, the csv columns are not all strings, there are long types.
Yes, I can provide schema text without using Schema Registry.
For your first solution, I think the index starts from 1. ${fragment.index:gt(1)}
Thanks.
Created 06-30-2017 08:04 PM
SplitText for some reason starts the index at 1, the other Split processors start at 0. Sorry I had forgotten that difference, good catch!
Created 12-22-2020 05:46 PM
Currently i want to load the txt file(not csv) into postgres. i want to remove the header for the txt file
I have use this processors (ListenFile-FetchFile-Splitext-RouteOnAttribut and ReplaceText(for regex).I try your propose but it's no okay for me
please can you did me what i doing
you find Attached the screenshot
Created 09-29-2021 09:06 AM
@mburgess I used your 1st suggestion and it worked like a charm with just one exception. The header row was index 1. I'm not sure if was just me, my data, or some property/attribute I set wrong. Just thought you should know. So, after modifying the user-defined attribute value to ${fragment.index:gt(1)} it worked. And, in case you ask, the header row is the first row in the CSV file which doesn't make sense unless the processor logic changed to 1-based indexing instead of 0-based indexing.
Also, thanks for all of your blog posts. I use your suggestions a lot.
Created 06-30-2017 02:00 PM
Thank you for your response.
The first solution works for me.
For the second solution, may I ask which processors should I use, since CSVReader is a service, which also requires schema and schema registry.
Thanks.