Member since
07-30-2019
3469
Posts
1641
Kudos Received
1018
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 183 | 05-06-2026 09:16 AM | |
| 295 | 05-04-2026 05:20 AM | |
| 249 | 05-01-2026 10:15 AM | |
| 470 | 03-23-2026 05:44 AM | |
| 355 | 02-18-2026 09:59 AM |
03-03-2021
06:28 AM
1 Kudo
@Pavitran This does present a challenge. Typically the ListFile is used to list files from a local file system. That processor component is designed record state (default based on last modified timestamp) so that only newer files are consumed. But the first run would result in listing all files by default. Also looking at your example, your latest directory does not correspond to current day. The listFile (does not actually consume to content) generates a 0 byte FlowFile for each file listed along with some attributes/metadata about the source file. The FetchFile processor would then be used to fetch the actual content, this allows with large listing to redistribute these 0 byte FlowFiles across all nodes in your cluster before consuming the content (provided same local file system is mounted across all nodes. If different files per node, do not load balance between processors). So you could make a first run which lists everything and just delete those 0 byte files. That would establish state. Then from that point on the ListFile would only list the newest files created. Pros: 1. State allows this processor to be unaffected by outages, the processor will still consume latest all non previously listed files after an outage. Cons: 1. You have this initial run which would create potentially a lot of 0 byte FlowFiles to get rid of in order to establish state. 2. With an extended outage, on restart of the flow it may consume more than just the latest since it will be consuming all files with newer timestamps than timestamp last stored in state. Other options: A: The ListFile processor has an optional property that sets the "Maximum File Age" which limits the listing of Files to those not olde then x set amount of time. Pros to. setting this property: 1. Reduces or eliminates the massive listing on first run Cons to setting this property: 2. Under an extended outage, where outage exceeds configured "Maximum File Age", a file you wanted listed may be skipped. B: Since the FetchFile uses attributes/metadata from to fetch the actual content, you could craft a source FlowFile on your own and send it to the FetchFile processor. For example, use a ExecuteStreamCommand processor execute a bash file on disk to get the list of Files only from the latest directory. Then use UpdateAttribute to add the other required attributes needed by FetchFile to get the actual content. Then use SplitFile to split that listing of files in to individual FlowFiles before the FetchFile processor. Pros: 1. You are in control of what is being listed. Cons: 1. Depending on how often a new directory is created and how often you run your ExecuteStreamCommand processor, you may end up listing the same source files over again since you will not have a state option with ExecuteStreamCommand. But you may be able to handle this via detectDuplicate processor in your flow design. 2. If the listed Directory has some new file added to it post previous listing by ExecuteStreamCommand, next run will list all previous files again along with new ones for same directory. Again, might be able to handle this with detectDuplicate processor. Hope this helps give you some ideas, Matt
... View more
02-11-2021
07:57 AM
@adhishankarit When moving on to a new issue, I recommend always starting a new query for better visibility. (for example, someone else in the community may have more experience with new issue then me). As far as your new query, your screenshots do not show any stats on the processor to get an idea of what we are talking about here in terms of performance. How many fragments are getting merged? How large are each of these fragments? NiFi nodes are only aware and have access to the FlowFiles on the individual node. So if node a is "out" (not sure what that means), any FlowFiles still on node "a" that are part of the same fragment will not yet be transferred to node b or c to get binned for merge. The Bin can not be merged until all fragments are present on the same node. Since you mention that bin eventually happens after 10 minutes, tells me that eventually all fragments eventually make it on to the same node. I suggest the first thing to address here is your space issue on your nodes. Also keep in mind that while you have noticed that node "a" has always been your elected primary node, there is no guarantee that will always be the case. A new Cluster Coordinator and Primary node can be elected by Zookeeper at anytime. If you shutdown or disconnect currently elected primary node "a" you should see another node get elected as primary node. Adding node "a" back in will not force ZK to elect it back as primary node. So don't build your flow around a dependency on any specific node being primary node all the time. Matt
... View more
02-10-2021
10:27 AM
1 Kudo
@bsivalingam83 The ability to "ignore" properties in various NiFi config files was added with the CFM 1.0.1 release. With older CFM versions (1.0.0) you can set a safety valve to overwrite the current set java.arg.13 value with something else: Above simply defines a key=value pair which would simply not be used by NiFi bootstrap. End result, NiFi no longer using G1GC and instead using the default Garbage Collector for your version of Java being used. Hope this helps, Matt
... View more
02-10-2021
09:03 AM
3 Kudos
@Jarinek The process really depends on what update you are trying to make. 1. You can not remove a connection that has queued FlowFiles in it, but you can redirect it to a different target processor with queued data. 2. You can not redirect a connection if the processor it is currently attached to still has a running thread. Stopping a processor does not kill threads, it simply tells the processor to not execute again at the configured run schedule. Existing threads will continue to run until they complete. Until all threads exit, the processor is still in a state of "stopping" even though UI reflect red square for "stopped". 3. You cannot modify a processor if is still has running threads (see note about "stopping" processors above) 4. If you stop the component that is on the receiving side of a connection, any FlowFiles queued on that connection, not tied to any active thread still running on target processor component, will not be processed and remain queued on the connection. You can manual empty a queue through a rest-api call (means data loss), but that is not necessary if you are not deleting the connection. Attempts to perform configuration changes when components still have active threads or are in a running state will result in an exception being thrown and the change not happening. Attempts remove connections that have queued FlowFiles will throw an exception and block removal. Now if all you are trying to do is modify some configuration on a processor, all you need to do is stop the processor, check that it has no active threads, make the config change, and then start the processor again. Not sure wha you are asking with "update the flow ignoring any data in failure or error connection queues". NiFi does not ignore queued FlowFiles. It also not wise to leave connection with queued FlowFiles just sitting around your dataflows. Those old queued FlowFile will prevent removal or content claims that contain that FlowFiles data. Since a content claim can contain the data from 1 to many FlowFiles, this can result in your content repository filling up. NiFi can only remove content claims which have no FlowFiles pointing to them anymore. Here are some useful links: https://nipyapi.readthedocs.io/en/latest/nipyapi-docs/nipyapi.html https://github.com/Chaffelson/nipyapi http://nifi.apache.org/docs/nifi-docs/rest-api/index.html https://community.cloudera.com/t5/Community-Articles/Update-NiFi-Connection-Destination-via-REST-API/ta-p/244211 https://community.cloudera.com/t5/Community-Articles/Change-NiFi-Flow-Using-Rest-API-Part-1/ta-p/244631 Hope this helps, Matt
... View more
02-09-2021
05:52 AM
@medloh That is the correct solution here, the filename is always stored in a FlowFile attribute named "filename". Using the updateAttribute processor is the easiest way to manipulate the FlowFile attribute. You can use other attributes, static text, and even subjectless functions like "now()" or "nextInt()" to create dynamic filenames for each FlowFile. https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html Hope this helps, Matt
... View more
02-09-2021
05:48 AM
@Umakanth The GetSFTP processor actually creates a verbose listing of all Files form target SFTP for which it will be getting. It then fetches all those files. Unlike the ListSFTP processor, the getSFTP is an older deprecated processor that does not store state. My guess here is that at times the listing is larger then other times or as you mentioned some occasional latency occurs resulting in enough time between creating that list and actually consuming the files, that the source system has moved the listed file before it is grabbed. In that case moving to the newer ListSFTP and FetchSFTP processors will help in handling that scenario. The listing will list all the files it sees and the FetchSFTP will fetch the content for those that have not yet been moved by the source system. The FetchSFTP will still throw an exception for each file it can not find still and route those to the not.found relationship which you can handle programmatically in your NiFi dataflow(s). Thanks, Matt
... View more
02-08-2021
08:45 AM
@medloh The Schema only needs to be defined in the RecordReader configured in the PutParquet processor. In the case of the ConvertRecord processor there exists both a Record reader and a Record Writer. You can have the RecordReader get the Schema from the Record Writer or define its own Schema. Hope this helps, Matt
... View more
02-08-2021
08:20 AM
@Jarinek NiFi Variables can only be used by component properties that support NiFi's Expression Language (EL). NiFI Parameters can be used in ANY component property including those that are encrypted. This gives more flexility to users, especially those users who use NiFi-Registry to promote version controlled Process groups across multiple NiFi instances/clusters. It is often the case that different environments have different URLs and passwords in use within the same dataflows. A Dataflwo can thus be promoted to another environment that simply uses different Parameter values thus not requiring the user to update a large number of components each time a new version of flow is promoted from one environment to another. You are correct that Parameters are similar to Variables in respect to assignment to a process group. You can only have one Parameter context assign to a Process Group. Hope this helps, Matt
... View more
02-05-2021
08:09 AM
1 Kudo
@medloh The article you are using for reference is old and bit out of date. As part of the work that went into NIFI-3921 the schema properties within the putParquet processor were removed. Before these changes were made at the time of that article you referenced, you had to set the schema properties and they had to match the schema properties set in the recordReader. With the changes the processor simply gets them from the reader so they do not need to be configured a second time in the processor properties. Also at time of that article there was no parquetReader or ParquetRecordSetWriter controller services. Now that NiFi has a ParquetReader and writer, you can use the ConvertRecord processor to read a source FlowFiles and convert it parquet within your dataflow and then have freedom to use whatever processor you want downstream in your dataflow to write out the parquet content. You can think of the putParquet as a combination of ParquetRecordSetWriter and putHDFS with a selectable recordReader only. Hope this helps, Matt
... View more
02-02-2021
07:12 AM
@BhaveshP I am in complete agreement with @tusharkathpal response. But you should be able to work around this issue through a configuration change in your nifi.properties file. nifi.web.proxy.host=dev.example.com:<port number> Property description: A comma separated list of allowed HTTP Host header values to consider when NiFi is running securely and will be receiving requests to a different host[:port] than it is bound to. For example, when running in a Docker container or behind a proxy (e.g. localhost:18443, proxyhost:443). By default, this value is blank meaning NiFi should only allow requests sent to the host[:port] that NiFi is bound to. Since the hostname your client is using does not match any SAN in the individual nodes certificates, the above property allows NiFi to accept this additional hostname. The other option is to create new certificates for each of your NiFi nodes where "dev.example.com' is added as an additional SAN entry. Hope this helps, Matt
... View more