About MattWho

MattWho · ‎06-13-2017

@forest lin NiFi at is core has no issues working with very large files. Often times, when you run into OOM it is because of what you are trying to do with those very large files after they are in NiFi. In the majority of the cases OOM can be avoided via dataflow design and tweaks to the heap size allocated to the NiFi JVM. The content of a FlowFile does not live in heap memory space, but the FlowFile attributes do (*** except when swapped out to disk in large queues). So avoid extracting large amounts of the content into FlowFile attributes, avoid trying to split very large files in to large numbers of small FlowFiles using a single processor, avoid trying to merge a very large number of FlowFiles in to a single FlowFile, etc... You can still do these types of things but may need to do it in two stages rather then one. For example Splitting large files by every 5000 lines first and then split 5000 line FlowFiles by every line (Huge difference in heap usage). If you found this answer addressed your question, please mark it as accepted to close out this thread. Thanks, Matt

MattWho · ‎06-13-2017

@Oleksandr Solomko You can see where these files are queued via the "summary" UI: Once the Summary UI opens, select the "CONNECTIONS" tab. You can sort on any column by clicking that column. Once you have found the row for your queued connection, click on the "view connection details icon ( )on the far right side of the row. This will pop open a new UI that shows queue breakdown per node in cluster. This will help you identify if you are having a cluster wide issue here or it is localized to one specific node. If it is just one node with all this queued data, you could manually disconnect this node from your cluster. Then go directly to the URL for that disconnected node. See if you can empty the queue then. Check for ERROR or WARN logs specifically in that nodes nifi-app.log, nifi-user.log, and nifi-bootstrap.log. What OS and Java version are you running also? Thanks, Matt

MattWho · ‎06-13-2017

@forest lin Backpressure is not used to control data rate in your dataflow. The intent of the backpressure setting on connections is to control the amount of allowed queued data. Both Back pressure settings are "soft" limits. Once backpressure kicks in on a connection, the processor feeding that connection will no longer be allowed to run. So in you case above, you have backpressure set to 5 Objects (FlowFiles) or 5 KB of content. Since your queue is empty, no backpressure was being applied when the 37.05 MB FlowFile arrived at your ConvertCSVToAvro processor, so that processor was allowed to run. That 1 FlowFile was processed through and placed on the outbound connection. It is at that time back pressure kicked in because you exceeded one of your backpressure settings. The ConvertCSVToAvro processor will now be prevented from running until that backpressure drops below 5 FlowFiles or 5 KB of queued data again. If all your processor are processing FlowFiles rapidly, back pressure will be very sparsely applied. Also keep in mind for efficiency some processors work on batches of FlowFiles. You may see for example with a backpressure object threshold of 5 a queue with more then 5 FlowFiles. The batch of FlowFiles are placed on an outbound queue. That processor who did the batch processing will then not be allowed to run again until that outbound connection drops again below 5 FlowFiles. The ControlRate processor allows you to actually control the throughput of a dataflow. It does not slow the processing. The ControlRate processor will allow data to queue in its input side and based on its configured setting only allow x number of FlowFiles through over y amount of time. lets say it is configured to let 5 KB of data through every 1 minute. If you feed it a 37 MB file, it does not transfer just pieces of that FlowFile. It will feed through the entire 37 MB FlowFile and then not allow another FlowFile through until the average data per 1 minute is 5 KB. Because of how the above works, data could continue to queue in front of ControlRate. This is where backpressure settings become important to stop upstream processor from running. You can set backpressure all the way upstream to your data ingest processors so they stop accepting new FlowFiles. Thanks, Matt

MattWho · ‎06-12-2017

@Justin R. Is this a NiFi cluster installation with multiple nodes running on the same host? If that is the case, which ever node manages to bind to the port first wins, all other nodes on same host will report that port is already in use. Matt

MattWho · ‎06-12-2017

@Ahmad Mehr When you start NiFi, the UI does not become available until the application has completed loading. /bin/nifi.sh status The above command simply shows that the application is running, but does not indicate the UI is available yet. To verify that NiFi has completed the startup process and the UI is now available, you will need to look in the nifi-app.log for the following lines: 2017-06-12 09:16:16,029 INFO [main] org.apache.nifi.web.server.JettyServer NiFi has started. The UI is available at the following URLs: 2017-06-12 09:16:16,029 INFO [main] org.apache.nifi.web.server.JettyServer http://<HOSTNAME>:8075/nifi 2017-06-12 09:16:16,031 INFO [main] org.apache.nifi.BootstrapListener Successfully initiated communication with Bootstrap 2017-06-12 09:16:16,031 INFO [main] org.apache.nifi.NiFi Controller initialization took 14617467433 nanoseconds. Until you see these log lines, the UI will not be accessible. You can also run the following linux command to see if "something" is listening on port 8075 yet: netstat -ant|grep LISTEN|grep 8075 Thank you, Matt

MattWho · ‎06-09-2017

@Eric Lloyd Input and Output ports are designed to send or receive data from one level up. When an input or output port is added at the root canvas level the one level up is another out of the system. You will also notice that ports added to the root canvas are rendered a little differently. There is an open Apache Jira on this subject, feel free to add your comments and use case to it: https://issues.apache.org/jira/browse/NIFI-2933 The current feeling is that adding Remote input and output ports should be left to the system administrator. This is because in a secured connection the admin must add the connecting systems as new users and authorize them to access these ports. Users are not typically granted this level of access. Thanks, Matt

MattWho · ‎06-08-2017

@Daniel Frank If you use @Matt Clarke in your response, I do not get an email notification. I am not following how you use the filename and path to file (B) to parse a totally different file (C) from the filesystem. Have you looked at the FetchFile processor. It accepts a FlowFile as input and uses attributes set on the incoming FlowFile to specify what file to fetch and from where. So you could getFile (B), extract what you need from file (B) into attributes that FetchFile can use to get File (C). FetchFile will stream the content of file (C) into the FlowFile originally belonging to File (B); however, the resulting FlowFile will retain all the FlowFile Attributes that already existed on FlowFile (B). Thanks, Matt If you found this answer addressed your question, please mark as accepted to close out this thread in the community.

MattWho · ‎06-08-2017

@Daniel Frank What format is your data in? (text?) Is all the information you need in the content of these files? The getFile processor already writes attributes for the following on every FlowFile it creates: You could use the ExtractText processor to read the FlowFile content and extract bits to FlowFile Attributes. Thanks, Matt

MattWho · ‎06-08-2017

@Anthony Murphy NiFi is designed to be resilient. It is designed to restore processor to last known state on startup (That state may be enabled, disabled, started, or stopped.) Are you sure these component processors where not stopped before the abrupt shutdown/restart of the server occurred? This is odd since you say it only happens occasionally. And I will be honest, this is the first time i have heard this issue. Is it always the same processors that fail to start? Are the processors that fail to start configured to use any NiFi Controller services? if so, are those Controller Services failing to start also? Check the nifi-app.log during startup to see if their were any logged ERROR or WARN messages related to these processor or controller services on startup. Thanks, Matt

MattWho · ‎06-08-2017

@Mahmoud Shash There was a bug identified in the Controller service UI of HDF 2.1.3. This bug affected users ability to modify, enable, disable and delete controller services. The HDF 2.1.3 release was pulled down. This bug was addresses in HDF 2.1.4. If you upgrade to HDF 2.1.4 you will be able to successfully access the Controller services in the CS UI. Thanks, Matt

Online	Offline
Last Visited	‎11-19-2025 08:50 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎11-19-2025 08:50 AM
Posts	3,391
Kudos received	1614

Cloudera Community

Re: How to achieve inheritence within Parameter Co...

Re: Cannot access the NiFi Registry from NiFi and ...

Re: using nifi as a kafka streaming- real-time str...

Re: using nifi as a kafka streaming- real-time str...

Re: Nifi Registry and LDAP

Re: ControlRate and BackPressure seems not work fo...

Re: Unable to clear Nifi Queue

Re: ControlRate and BackPressure seems not work fo...

Re: Why won't HandleHTTPRequest bind to a secure (...

Re: NiFi Web UI not working -Facing ERROR- ERR_CON...

Re: Cannot access input port in a process group?

Re: ,How can I enrich flow file attributes with da...

Re: ,How can I enrich flow file attributes with da...

Re: Nifi not starting a group of processors on res...

Re: I can't run putHiveQL