About MattWho

MattWho · ‎01-18-2018

@Alvin Jin NiFi would have never started if a port was not specified. Let me be clear that you are setting up yoru NiFi to either run over a secure or non-secure port and not both. for a non-secure NiFi set these: nifi.web.http.host= nifi.web.http.port= and for a secured NiFi, set these: nifi.web.https.host= nifi.web.https.port= I am not familiar with Kubernete containers. Is the hostname of the container the same as the public facing hostname? Thank you, Matt

MattWho · ‎01-18-2018

@dhieru singh Yo may also consider a two phase approach to mergeContent processor. By using two MergeContent processors in series, you will reduce your NiFi's heap usage and the number of FLowfiles needing to be merged in each iteration. For example, you may have first mergeContent merge based on number of FlowFiles (min 10000 and max 15000). Then have the second merge on size (assuming 64 MB to 128 MB for your HDFS, you would set min size to 64 MB and max to 128 MB). Thanks, Matt

MattWho · ‎01-18-2018

@dhieru singh When you say "all processors" are being overwhelmed, are you saying connection between every single processor is filling and triggering back pressure in your dataflow? Have you looked that the resource of your hardware running your NiFi instance? Is CPU becoming, memory, and/or disk I/O becoming saturated during these spikes? If so, there is not much with in the configuration of NiFi that can help much here. In a case like this it would require that you expand your NiFi into a cluster. You then have two options for your ListenTCP feed. 1. Run the ListenTCP processor on all nodes and place an external load-balancer to distribute the TCP traffic to every node. 2. Have the ListenTCP processor receive data on only one node, but immediately feed the success relationship form that ListenTCP processor to a Remote Process Group (RPG) that can be used to redistribute the received FlowFiles to all nodes in your cluster to spread out the work being done by the rest of the processors in your dataflow(s) If your resources are not saturated, make sure you have allocated enough "Max Timer Driven Threads" to your NiFi instance so that all processors are fully utilizing those server CPU resources. Defaults for NiFi are only 10. The Max Timer Driven Thread count can be adjusted in the "Controller settings" UI found within the hamburger menu in the upper right corner. Note: do not adjust defaults for Event Driven Thread Count. This just increase a thread pool that i not used by default. If disk I/O is high, following best practices to make sure the NiFi logs, Provenance repository(s), Content repository(s), and FlowFile repository are all located on their own physical disks would help here. Thank you, Matt

MattWho · ‎01-18-2018

@Alvin Jin With NiFi 1.5, NiFi has become more restrictive with requard to the allowed headers coming from a client. The hostname in the request header is checked against the configured hostname in the nifi.properties file: nifi.web.http(s).host= If they do not match, you will encounter the error you are seeing. So you will need to access the NiFi UI in 1.5 using the same hostname as specified in that property. Thanks, Matt

MattWho · ‎01-18-2018

@Andrew thomas Once you have a NiFi cluster installed and running, any change you make within the NiFi Ui affects every node in the cluster. There is no way in a NiFi cluster to deploy a flow to just one node only. There is no need to stop any existing running flow within the UI when you are adding/building new flows within the same UI. Any configuration changes that affect any one of NiFi's configuration files (except logback.xml) will require a NiFi restart before those changes will take affect. Thank you, Matt

MattWho · ‎01-17-2018

@Andrew thomas When a NiFi node starts, it unpacks all its nar to the work directory. Before joining an existing cluster, it checks that the following three files locally match exactly what is being used in the cluster: flow.xml.gz <-- contains all configurations done via NiFi UI users.xml <-- Will only exist if NiFi is secured and the default NiFi file based authorizer is being used. authorizations.xml <-- Will only exist if NiFi is secured and the default NiFi file based authorizer is being used. At no time does NiFi compare to make sure that all nodes are running the same set of nars. Mismatched nars only becomes an issue when that component is being used in the flow.xml.gz. In your scenario where you are try to deploy a new custom nar which is not yet being used on your canvas you can update each node one at a time. I recommend creating a new lib directory on each of your NiFi nodes that will hold all your custom nars separately from the default NiFi lib directory. Then modify the nifi.properties file on each of your nodes (this can be done while NiFi is still running since this file is only read on startup). Add the following line to your nifi.properties files: nifi.nar.library.directory.custom=/<path to my custom nifi nars directory> Make sure this directory and your custom nars placed in this directory have proper ownership and permissions. In order to minimize impact on your operational cluster, restart one node at a time. Once a restarted node has joined the cluster, move on to the next. Once all nodes have been restarted and cluster has all nodes reconnected, you can start using your custom processor components on the canvas. Thank you, Matt If you found the answer addressed your question, please take a moment to click on "accept" found below the answer.

MattWho · ‎01-16-2018

@Chris Lundeberg Are you entering "${schema.fingerprint}" or "schema.fingerprint" in the "Correlation Attribute Name" property field of the MergeContent processor? If you are looking to bin files where the value assigned to the "schema.fingerprint" attribute matches, you will want to enter only "schema.fingerprint" in that property. If you want your correlation attribute name to be more unique, you can use the updateAttribute processor before the MergeContent to create something more unique based on both ${schema.fingerprint} and ${tablename}. for example: followed by MergeContent configured similar to below: You understanding of my original explanation was correct. That could also explain missing attributes since all Flwofile without the attribute would end up in same bin and only matching attributes would be retained on merged FlowFiles. As far as one or many duplicate flows. It makes sense to have only a single flow here, but having many identical flows is also ok. Only thing to consider is NiFi processors obtain threads from a thread pull configured in NiFi. The more processors you have, the more processors there are requesting to use a thread from that thread pool. Plus more dataflows just means more processors to manage. The "Max Timer Driven thread count" is set within "Controller settings" found within the hamburger menu icon in the upper right corner of the NiFi UI. You will also find a setting for "Max Event Driven Thread count there",but do not change the value there. There is nothing you will add to the canvas that will use Event Driven Threads unless you specifically configure the processor to use them. This is a deprecated feature that only still exist to avoid breaking backwards compatibility with older flows. Timer Driven thread are much more efficient and will out perform Event Driven threads anyway. The default for Timer Driven threads is only 10. a good staring place here is 2 - 4 times the number of cores you have on a single NiFi host. Assume a 4 node NiFi cluster 4 hosts X (32 cores per node). You would set the Max Timer driven Thread count setting to 64 - 128. Assuming you set it to 32, this would mean there are 32 threads available per host for a cluster total of 128. Monitor top on your systems as you run your dataflow and adjusts as you see fit from there. Thanks, Matt Please take a moment to click "Accept" if you feel I have addressed your questions.

MattWho · ‎01-16-2018

@Chris Lundeberg Maybe helpful to share your MergeContent processors configuration here. 1. How many bins is the processor configured to use? 2. Sounds like each incoming FlowFile may have a considerable Attribute map size. All the attributes of the FlowFiles being merged are held in heap memory until the merge is complete, You may be having heap issues. Seen any Out of Memory errors in the nifi app .log? 3. What is the correlation attribute you are using to bin like FlowFiles? 4. How large is each FlowFile being merged? If they are very small (meaning it would take more then 20,000 of them to reach a 64 MB merged file), you may want to use multiple mergeContent processors in series to reduce the heap usage. Useful links: https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.html https://community.hortonworks.com/questions/87178/merge-fileflow-files-based-on-time-rather-than-siz.html I have no personally seen FlowFiles routed to Failure losing their attributes. That seems very odd to me. The merged FlowFile, depending on configuration, may have different attributes however. I am assuming that your "avro_schema" attribute nay be fairly large. It may be better to use something smaller for your correlation attribute value in the MergeContent processor. You could use the ExtractAvroMetadata processor before the MergeContent processor. It will give you a "schema.fingerprint" attribute you could use instead to accomplish the same. Are you putting "${avro_schema}_${tablename}" in the mergeContent processor's Correlation Attribute Name property value? What this property does is resolve the provide EL above to its actual values then checks the incoming Flowfiles for an attribute with that resolved value. If found it places FlowFiles where the value of that resolved attribute match in the same bin. Just want to make sure you are using this property correctly. All FlowFiles that do not have the FlowFile attribute are allocated to a single bin. You also need to make sure your mergeContent processor is configured to have enough bins (number of needed bins +1) to accommodate all the various possible unique correlation attribute values. If you do not have enough bins, the mergeContent will force the merging of the oldest bin to free a bin to continue allocating additional FlowFiles. Thank you, Matt

MattWho · ‎01-16-2018

@Eric Lloyd With the above configuration, it would only take 1 FlowFile to be assigned to a bin before that bin was marked eligible for merging. There is nothing there that force the processor to wait for other FlowFiles to be allocated to a bin before merge, Both minimums are set to 1 FlowFile and 0 Bytes. In order to actually get 100,000 Flowfiles (this is high and may trigger OOM), there would need to be 100,000 Flowfiles all with the same correlation attribute value in the incoming connection queue at the time the processor runs. This is almost certainly not going to be the case. The Max bin age simply sets an exist strategy here. It will merge a bin regardless if minimums have been met if the bin age has reached this value. You may want to set more reasonable values for your mins and also consider using multiple mergeContent processors in series to step up to the final merged number you are looking for. Thanks, Matt

MattWho · ‎01-16-2018

@Roger Young The Remote Process Group (RPG) is not designed for dynamic target URL assignment. It is designed to communicate with a target standalone of NiFi cluster. During that communication it learns about all currently connected nodes in a target Nifi cluster and retains the URLS for all those nodes so it can perform a load-balanced delivery of data. It the event RPG cannot get an updated listing form the target it will continue to try to delivery to the last known set of target nodes. Since the RPG was never intended to be used to delivery data to multiple independent target NiFi instances, the ideal of dynamic URL was never considered. There are other NifI processors such as putHTTP and InvokeHTTP that can take NiFi Expression Language (EL) as input for the target URL. Thank you, Matt

Online	Online
Last Visited	‎02-01-2026 02:49 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎02-01-2026 02:49 AM
Posts	3,427
Kudos received	1628

Cloudera Community

Re: Setting TTL per key when writing to redis

Re: Best Practice for configuring registry flows

Re: Nifi 2.7.2 Start Problem

Re: Error importing NiFi workflow template from ve...

Re: Error importing NiFi workflow template from ve...

Re: NiFi 1.5 System Error: Invalid host header

Re: nifi data flow handling data spikes in listent...

Re: nifi data flow handling data spikes in listent...

Re: NiFi 1.5 System Error: Invalid host header

Re: Deployment in Production NiFi cluster

Re: Deployment in Production NiFi cluster

Re: MergeContent Nifi - Using the Correlation Attr...

Re: MergeContent Nifi - Using the Correlation Attr...

Re: Merge Fileflow files based on time rather than...

Re: Can apache minifi dynamically transmit to diff...