Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to remove the header when using NiFi SplitText processor

Solved Go to solution

How to remove the header when using NiFi SplitText processor

Contributor

Hello,

I have a csv file with the first line as header.

When I use SplitText processor, the split tiny files contain that header as in first line.

Is there an easy way to generate the split file without header?

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: How to remove the header when using NiFi SplitText processor

You could set the Header Line Count to 0, then send the flowfiles to a RouteOnAttribute processor where you can "skip" the first line by routing on the following Expression Language statement:

${fragment.index:gt(0)}

The first line will be routed to "unmatched" and the rest to "matched" or the user-defined property name (depending on the value of the Routing Strategy property). Note that this requires the Line Split Count property be set to 1 in SplitText.

Alternatively, if you are using (or can upgrade to) NiFi 1.3.0, you can use a record-aware processor with a CSVReader. This reader can be configured to (among other things) skip the header line. The record-aware processors also offer better performance when working with flow files that contain many "records" (such as a CSV file where each "record" is a row).

5 REPLIES 5
Highlighted

Re: How to remove the header when using NiFi SplitText processor

You could set the Header Line Count to 0, then send the flowfiles to a RouteOnAttribute processor where you can "skip" the first line by routing on the following Expression Language statement:

${fragment.index:gt(0)}

The first line will be routed to "unmatched" and the rest to "matched" or the user-defined property name (depending on the value of the Routing Strategy property). Note that this requires the Line Split Count property be set to 1 in SplitText.

Alternatively, if you are using (or can upgrade to) NiFi 1.3.0, you can use a record-aware processor with a CSVReader. This reader can be configured to (among other things) skip the header line. The record-aware processors also offer better performance when working with flow files that contain many "records" (such as a CSV file where each "record" is a row).

Re: How to remove the header when using NiFi SplitText processor

@Alvin Jin To answer your question about which processors to use: it depends on what you want to do with the whole CSV file. Your question only mentions splitting and ignoring the header, the CSVReader takes care of that. The record-aware processors in NiFi 1.3.0 include:

ConsumeKafkaRecord_0_10: Gets messages from a Kafka topic, bundles into a single flow file instead of one per message

ConvertRecord: Converts records from one data format to another (Avro to JSON, e.g.)

LookupRecord: Uses fields from a record to lookup a value, which can be added back to the record

PartitionRecord: Groups "like" records (based on user-provided criteria) into individual flow files

PublishKafkaRecord_0_10: Posts messages to a Kafka topic

PutDatabaseRecord: Executes a specified operation (INSERT, UPDATE, DELETE, e.g.) on a database for each record in a flow file

PutElasticsearchHttpRecord: Executes a specified operation ("index", e.g.) on an Elasticsearch cluster for each record in a flow file

QueryRecord: execute SQL queries on fields from the records. This can be used to filter, aggregate, etc.

SplitRecord: Splits records into smaller flow files. Usually only used when downstream processors are not record-aware

UpdateRecord: Updates field(s) in each record of a flow file

Also I wanted to mention, if for some reason all your CSV columns are strings, you can set "Schema Access Strategy to "Use String Fields From Header", and then you don't need a schema or schema registry. Otherwise if you want to provide a schema, you're not required to use a schema registry, you can just paste your schema into the Schema Text property. and set "Schema Access Strategy" to "Use Schema Text Property".

Re: How to remove the header when using NiFi SplitText processor

Contributor

In my case, the csv columns are not all strings, there are long types.

Yes, I can provide schema text without using Schema Registry.

For your first solution, I think the index starts from 1. ${fragment.index:gt(1)}

Thanks.

Re: How to remove the header when using NiFi SplitText processor

SplitText for some reason starts the index at 1, the other Split processors start at 0. Sorry I had forgotten that difference, good catch!

Re: How to remove the header when using NiFi SplitText processor

Contributor

Hi @Matt Burgess

Thank you for your response.

The first solution works for me.

For the second solution, may I ask which processors should I use, since CSVReader is a service, which also requires schema and schema registry.

Thanks.

Don't have an account?
Coming from Hortonworks? Activate your account here