About MattWho

MattWho · ‎03-24-2020

@Faerballert The NiFi merge based processors only offer the option to "Keep Common Attributes" (keeps on attributes were every merged file has same attributes with same value) or "Keep all Unique Attributes" (same as above, but will also keep attributes that is unique. This means any attribute that exists in 1 or more of the merged FlowFiles were the value assigned to that attribute is the same in cases where attribute was found on more than one FlowFile being merged). There is no option to merge all attributes creating a comma separated list of unique values. What is the use case for such a n attribute merge need? There is no way to tell which value goes with which chunk of the merged data. Plus if the merged FlowFile were later split, every produced split FlowFile would have all the same FlowFile attributes. Hope this helps, Matt

MattWho · ‎03-24-2020

@domR i) Do List processors w/ timestamp tracking store state locally? --- If you are running a standalone NiFi and not a NiFi cluster, all state will be stored locally on disk. --- If clustered, this depends on the list processor and how it is configured. The ListFile processor can be configured to store state locally or remotely depending on your use case. For example a ListFile is added to a NiFi cluster and every node is listing from a local path not shared across all nodes, you would want each node to store the listFile state locally since it would be unique per node and other nodes have no access to the directory one each node. If your listFile is listing against a mounted directory that is mounted to every node in the cluster, the listFile should be configured for remote and configured t run on primary node only. --- Other list based processors all store state locally ONLY when it is a standalone NiFi. Clustered NiFi installs will trigger store to be stored in zookeeper. ii) Does this state survive NiFi restarts? --- Yes, local state is stored on disk in NiFI's local state directory. Cluster/Remote state is store in zookeeper. State configurable is handled by the state-management.xml configuration file. iii) If running on primary node only, would this mean when another primary node is chosen, the List processor would list any files it hasn't tracked (and re-ingress a large backlog of files if still there)? --- When a primary node change occurs, the primary node only processors on the previous primary node are asked to stop executing and the same processors on the newly elected primary node are asked to start. On the new node, that processor will retrieve that last known state stored in zookeeper for that component before executing. There is a small chance for some limited data duplication. When old elected primary node processors are asked to stop that does not kill active threads. If the processor is in the middle of execution and does not complete (update cluster state in ZK) before newly elected primary node pulls cluster state when it starts to execute, some files may be listed again by newly elected node, but it will not list from beginning. iv) What out-of-box solutions can help to get around the issue of non-persisted non-distributed listing, or do we need custom auditing triggering individual listings? --- NiFi does persist state through node restarts. Note; You can right click on a processor that stores state and select "view state" to see what has been stored. You can also right click on a processor and select "view usage" to open the embedded documentation for that component. The embedded documentation will contain a "State Management:" section that will tell you if the component stores state and if that state is stored locally or cluster (ZK). Hope this helps, Matt

MattWho · ‎03-24-2020

@Alexandros Going to ask the simple question first... There are FlowFiles traversing the processors in this newly instantiated flow from your template, correct? --- The next thought would then be around authorizations (assuming your NiFi is secured) 1. Is your user running the provenance query authorized to "view provenance" and "view the data" on the components? If these are set on the process group containing these processor components and not set on the components themselves, the components will inherit the polices from the process group. 2. Is this a NiFi cluster? If so, make sure your NiFi nodes are also authorized to "view provenance" and "view the data". When you authenticate in to NiFi and run a provenance query, that query is replicated to all nodes in your cluster. Those query results are then returned to the node on which the originating request was made. If that node is not authorized to view data returned from other nodes, it will not display. --- Then we need to make sure provenance it still working: While you are seeing provenance events displayed for your existing flow, are those returns recent? If you monitor the contents of your provenance_repository, do you see timestamps updating on the <num>.prov files? Need to make sure provenance has not stopped working for some reason. Also make sure you are using the WriteAheadProvenanceRepository implementation (should be default in 1.11) and not the PersistentProvenanceRepository implementation (configured in nifi.properties file). Hope this helps, Matt

MattWho · ‎03-24-2020

@Umakanth The "Unable to find valid certificate path to requested target" suggests an incomplete certificate trust chain. I would suggest looking getting a verbose output for your NiFi's truststore.jks and keystore.jks files. keystore -v -list -keystore <truststore.jks or keystore.jks> If this is a NiFi cluster, make sure you check the truststore.jks on all your NiFi nodes, they need to be able to successfully trust each other as well. When you access the UI via any node's URL, that request is sent to the cluster coordinator and replicated to all nodes in your cluster. Those node to node communication with also use mutual TLS authentication. A complete trust chain will include all Certificate Authority (CA)s (intermediate(s) all the way up to the root CA). When you look at your verbose output for your keystore.jks (it must contain only 1 PrivateKeyEntry that supports both clientAuth and ServerAuth) you will see an owner and issuer for your PrivateKeyEntry. That Issuer is the signer of your certificate. The signer may be an intermediate CA meaning that its not both the issuer and owner its certificate. If that CA has a different issuer, then it was signed by another CA. This means your truststore.jks must also include that CA as well. This process repeats until you reach the root CA. The rootCA will have the same DN for both owner and issuer. Once all those TrustedCertEntries exist in your truststore, you have the complete trust chain. The same holds true for the client certificate you are using to connect to your NiFi cluster. I noticed it is signed/issued by "localhost". Make sure localhost trustedCertEntry also exists in all your NiFi node's truststore.jks. If the localhost CA was signed by another CA, make sure that you have that traced all the way back to the root CA as well. Another thing you may want to try to see if it is a node to node SSL issue or a client to node issue, is start only one node in your NiFi cluster and try to access it after it is fully up. Hope this helps, Matt

MattWho · ‎03-16-2020

@Gubbi Depending on which processor is being used to create your FlowFile from you source linux directory, you will likely have an "absolute.path" FlowFile attribute created on the FlowFile. absolute.path = /users/abc/20200312/gtry/ You can pass that FlowFile to an UpdateAttribute processor which can use NiFi Expression Language (EL) to extract the date from that absolute path in to a new FlowFile attribute Add new property (property name becomes new FlowFile attribute): Property: Value: pathDate ${absolute.path:getDelimitedField('4','/')} The resulting FlowFile will have a new attribute: pathDate = 20200312 Now you can use that FlowFile attribute later when writing to your target directory in S3. I assume you would use the putS3Object processor for this? If so, you can configure the "Object Key" property with the following: /Users/datastore/${pathDate}/${filename} NiFi EL will replace ${pathDate} with "20200312" and ${filename} will be replaced with "gyyy.csv". Hope this helps you, Matt

MattWho · ‎03-16-2020

@saivenkatg55 Sorry to hear you are having space issues with your content repository. The most common reason for space issues is because there are still active FlowFiles referencing the content claims. Since a content claim cannot be moved to archive sub-directory or deleted until there are no FlowFiles referencing that claim, even a small FlowFile can still queued somewhere within a dataflow can result in a large claim not being able to be removed. I recommend using the NiFi Summary UI (Global menu --> Summary) to locate connections with flowfiles just sitting in them not getting processed. Look at the connections tab and click on "queue" to sort connections based on queued FlowFiles. A connection with queued FlowFiles, but shows 0 for both "In/Size" and "Out/Size" are what I would be looking for. This indicates in the last 5 minutes that queue has not changed in amount of queued FlowFiles. You can use the got to arrow to far right to jump to that connection on the canvas. If that data is not needed (just left over in some non active dataflow), right click on the connection to empty the queue. See if after clearing some queues the content repo usage drops. Is is also possible that not enough file handles exist for your NiFi service user making clean-up not to work efficiently. I recommend increasing the open files limit and process limits for your NiFi service user. Check to see if your flowfile_repository is large or if you have content claims moved to archive sub-directories that have not yet been purged. Does restart of NiFi which would release file handles trigger some cleanup of the repo(s) on startup? It is also dangerous to have all your NiFi repos co-located on the same disk for risk of corruption to your flowfile repository which can lead to data loss. The flowfile_repository should always be on its own disk, the content_repository should be on its own disk, and the provenance_repository should be on its own disk. The database repository can exist on a disk used for other NiFi files (config files, local state, etc.) https://community.cloudera.com/t5/Community-Articles/HDF-NIFI-Best-practices-for-setting-up-a-high-performance/ta-p/244999 Here are some additional articles that may help you: https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418 https://community.cloudera.com/t5/Community-Articles/How-to-determine-which-FlowFiles-are-associated-to-the-same/ta-p/249185 Hope this helps, Matt

MattWho · ‎03-06-2020

@sfishman You are correct that NiFi does not support multiple private keys within a keyring. I encourage you to create an Apache NiFi Jira with your details and this enhancement request. https://issues.apache.org/jira/browse/NIFI Matt

MattWho · ‎03-06-2020

@sfishman The EncryptContent processor supports encryption/decryption using the PGP encryption Algorithm. PGP requires that the relevant PGP properties have been configured. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.11.3/org.apache.nifi.processors.standard.EncryptContent/index.html Hope this helps, Matt

MattWho · ‎03-06-2020

@sfishman Currently FTPS is not supported. There is an existing open jira to add SSL capability to the get and put FTP processors here: https://issues.apache.org/jira/browse/NIFI-2278 As an alternative, you may consider using the putFile processor to write your file to disk and then use the ExecuteStreamCommand processor to execute some FTPS supported client from command line to send that file to your FTPS server. This way you can still accomplish what you want to do through an automated NiFi dataflow. Hope this helps, Matt

MattWho · ‎03-06-2020

@vikrant_kumar24 You would not configure your python script to write to an XML file on disk NiFi handles the FlowFile creation in the framework. Any data passed by your Python script to STDOUT will be populated into the content of the resulting flowfile passed to the output stream relationship of the ExecuteStreamCommand processor. Your script does not need to have any awareness of what. FlowFile is or how it is created. So you simply have your python script send the XML content to STDOUT and NiFi will take care of putting that content in to the FlowFile that will be produced and routed to the "output.stream" relationship of the processor. You can then use the updateAttribute processor the change the filename associated with that content. Hope this helps, Matt

Online	Offline
Last Visited	‎12-25-2025 04:21 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎12-25-2025 04:21 PM
Posts	3,406
Kudos received	1618

Cloudera Community

Re: Error importing NiFi workflow template from ve...

Re: Error importing NiFi workflow template from ve...

Re: How to elevate a default nifi user to admin - ...

Re: NiFi EnvokeHTTP - putting current date on HTTP...

Re: Invoking Nifi rest api in Data Flow

Re: Merge Attribute values from FlowFiles

Re: NiFi - List SFTP / HDFS Processors - State

Re: Data Provenance not showing on imported templa...

Re: NiFi SSL mutual authentication issue - Unable ...

Re: Move entire folder based on date to S3

Re: NIFI Content Storage is running out of space

Re: Nifi PGP Encrypt and Sign

Re: Nifi PGP Encrypt and Sign

Re: Can Nifi put a file to an FTPS server

Re: NiFi : Getting error in ExecuteStreamCommand w...