About bbende

bbende · ‎06-27-2017

Yes the message key is the same thing as record key. Ok so for your use case, you need more than just the same partition for all messages from a flow file. You want the same partition for all flow files from a given source. In that case, we should have a property in the processor for the partition that supports expression language, so then you could do something like... GetFile (source1) -> UpdateAttribute (set kafka.partition = 1) -> PublishKafka (partition = ${kafka.partition}) GetFile (source2) -> UpdateAttribute (set kafka.partition = 2) -> PublishKafka (partition = ${kafka.partition}) I'll clarify this on the JIRA. In the meantime you probably have a better chance of using PublishKafka_0_10 (the non-record version)... If you strip off the header before reaching this processor, then set the Message Demarcator for PublishKafka_0_10 to be a new-line, and set the key to ${filename}, you should get what you are looking for.

bbende · ‎06-27-2017

I think what we need is a way to control the partitioning independently of the message key... The message key is used on the broker side during compaction/age-off, and the latest record with a given message key will be retained. This would mean that all the lines of your CSV would be treated as if they were different versions of the same message, and at some point all of the records could be age-off except the latest one. I think what you would really want is to use the "id" field from your CSV as the message key, but then indicate to NiFi that all of these messages from this flow file should be sent to the same partition, which unfortunately doesn't currently exist. I created this JIRA to add that option: https://issues.apache.org/jira/browse/NIFI-4133

bbende · ‎06-23-2017

@Alvin Jin Also, I know you already implemented a custom service, but there is also some work here by one of the Apache NiFi committers: https://github.com/apache/nifi/pull/1938

bbende · ‎06-23-2017

Site-To-Site does not do anything to the contents of your flow files, if you have 3 flow files then it transfers 3 flow files. That statement is saying that site-to-site is optimized for a continuous flow of large amounts of data, so if you run a test with 3 flow files, it probably will send all 3 flow files to only of the nodes in your cluster because it wasn't enough data to reach the point where it would start sending to the other nodes.

bbende · ‎06-23-2017

1) Doesn't really matter, it just needs to be shared location that all nodes can access. 2) You can do this and it might work well for small files and small amounts of files, but typically the whole point is to perform the "fetch" in parallel, where as here all the files have to be fetched on primary node (GetFile) and then all of their contents have to be redistributed to the cluster, instead of just the listings.

bbende · ‎06-22-2017

Generally you only want Primary Node only for a source processor like ListHDFS where you only want to perform the listing one time.

bbende · ‎06-22-2017

No, you have a 3 node cluster, lets say node #1 is primary node... MiNiFi is sending data to all nodes so the data is already divided across all the nodes, but you are only scheduled to process it on node #1, so now data on nodes #2 and #3 will just sit there and never get processed.

bbende · ‎06-22-2017

Your SplitText processor is scheduled to run on Primary Node only which doesn't seem right. MiNiFi would send data to all nodes. Most likely the flow files that are sitting there are not on the primary node, which you can determine by doing a List Queue on that connection and looking at the host column on the right.

bbende · ‎06-22-2017

Your custom NAR needs to have a NAR dependency in the pom.xml on the standard services API: <dependency> <groupId>org.apache.nifi</groupId> <artifactId>nifi-standard-services-api-nar</artifactId> <type>nar</type> </dependency> If you can share your custom NAR code or pom files I can take a look.

bbende · ‎06-21-2017

Since binary concatenation is just writing chucks of raw bytes one after another, there is no real format that can be understood to undo it. There would be no way for another processor to read those bytes and know where it was merged together. If you use a demarcator when merging, then you can use that to unmerge by using SplitContent or SplitText.

Online	Offline
Last Visited	‎09-10-2020 01:23 PM

Member Since	‎09-29-2015 04:02 PM
Last Visited	‎09-10-2020 01:23 PM
Posts	871
Kudos received	709

Cloudera Community

Re: Using nifi registry in a nifi cluster.

Re: Is there a way to enable a stateful status upd...

Re: Automated Start/Stop of a NiFi Processor

Re: PublishKafkaRecord_0_10 1.2.0.3.0.1.1-5 Error:...

Re: how to configure mergecontent processor

Re: How does PublishKafkaRecord_0_10 use filename ...

Re: How does PublishKafkaRecord_0_10 use filename ...

Re: How to inject Custom Confluent Schema Registry...

Re: How to distribute files on NiFi cluster and pr...

Re: How to distribute files on NiFi cluster and pr...

Re: Data is becoming stuck after Input Port in Nif...

Re: Data is becoming stuck after Input Port in Nif...

Re: Data is becoming stuck after Input Port in Nif...

Re: How to inject Custom Confluent Schema Registry...

Re: How to unpack/de-merge a file in NiFi that was...