Member since
07-30-2019
2915
Posts
1444
Kudos Received
847
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
41 | 04-26-2024 06:40 AM | |
324 | 04-23-2024 05:56 AM | |
47 | 04-22-2024 06:13 AM | |
186 | 04-17-2024 11:30 AM | |
133 | 04-16-2024 05:36 AM |
04-26-2024
06:54 AM
@kelpye While I am not familiar with this specific processor, maybe the following suggestions will help since the exception shared states that NiFi is unable to validate against the configured property value in the "CopybookPath" property : 1. The following string in your copybook path is a NiFi Parameter reference: #{hdfs.output.raw-input-files} When NiFi enables this processor it resolves the above the value configured to it in the parameter context list. If you replace above with the absolute local path instead of using a parameter, does it validate? 2. Also keep in mind that all components executing on the canvas are executed as the NiFi service user and not as the authenticated user who added those components to the canvas. So make sure that the NiFi service user is able to navigate the resolved path to your copybook.cbl file and has proper OS permissions to read that file. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
04-26-2024
06:40 AM
@SAMSAL You can add additional attributes that you want to indexed with provenance that you could then use in your provenance searches. Take a look at the following properties available with the Write Ahead Provenance Repository: Since you want to be able to search on some FlowFile attribute, you would add it to the "nifi.provenance.repository.indexed.attributes". Keep in mind that adding additional indexed attributes or fields will increase the size of your provenance_repository disk usage. Added attributes or fields will start being indexed after restart of your NiFi. NiFi can not go back and reindex already processed FlowFiles, but this should help you going forward. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
04-26-2024
06:28 AM
@AlexisRub Not sure how to answer that for you. Typically production users who have access to a corporately managed LDAP/AD would use that with their NiFi. This provide better security as corporate can mange that adding of new users or removal of users no longer with the organization. If you also setup the ldap-user-group-provider in NiFi authorizers.xml along with setting of the ldap-provider in the login-identity-providers.xml you'll have a proper production setup. Let's say a new person joins the company and is added to the AD. the ldap-user-group-provider (depending on filters) could automatically pull in that new user identity to NiFi allowing your NiFi admin to setup access policies for them easily. And with the ldap-provider that user could then authenticate to your NiFi (successful authentication does not mean they would have authorized access). Even better is this opens the ability to use ldap/AD managed groups for authorization. Let's say you have AD group named nifiadmins. You could sync this group and its members to NiFi via the ldap-user-group-provider and set up local authorization policies using that group identity. So later some user is added or removed from the AD "nifiadmins" group. When NiFi syncs with ldap/AD via ldap -user-group-provider (default is every 30 mins), that user would be added or removed as a known member of that group and would gain or lose authorizations without needing any manual action within NiFi to make that happen. This is most common setup fro production end users with established ldap/AD groups for different teams that will access NiFi. Different teams can then be authorized access to only specific process groups and actions. I setup a local ldap which creates a bunch of fake users and groups that i can manage for testing purposes., but not something I would do in a production setup. I would leave the corporate management of user to those responsible for that access control. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
04-26-2024
06:12 AM
@gregbowers You say "This allows for basic user management without LDAP or Kerberos.", but what method of user authentication are you suggesting to be used for user authentication? Users and groups that are added via the UI and to which you apply various policies are NOT users that are managed by NiFi for authentication. Those added user are for setting authorizations policies only. Authentication must be handed by an authentication provider. The single-user-provider only support a single user and not multi-users @AlexisRub is looking to support. So what other provider are you suggesting is configured in the login identity providers? The only options that can be configured in the login-identity-providers.xml in Apache NiFi are single-user-provider, ldap-provider, and kerberos-provider. Are you suggesting some additional third party custom provider? Thank you, Matt
... View more
04-26-2024
06:02 AM
@s198 Back Pressure thresholds are configured on NiFi connections between processors. There are two types of back pressure thresholds 1. Object Threshold - Back pressure is applied once the Number of FlowFiles reaches or exceeds the setting (default is 10,000 FlowFiles). Applied per node and not across all nodes in a NiFi cluster. 2. Size Threshold - Back pressure is applied once the total data size of queued FlowFiles reaches or exceeds the setting (default is 1 GB). Applied per node and not across all nodes in a NiFi cluster. When Back pressure is being applied on a connection, it prevents the immediate processor that feeds data into that connection form being scheduled to execute until the back pressure is no longer being applied. Since back pressure is a soft limit, this explains you two different scenarios: 1. 20 FlowFiles being transferred to connection feeding your mergeContent processor. Initially that connection is empty so no back pressure is applied. The preceding processor that starts adding FlowFiles to that connection until the "Size Threshold" of 1 GB was reached and thus back pressure is then applied preventing the preceding processor from being scheduled and processing the remaining 6 files. The max bin age set on your mergeContent processor then forces the bin containing the first 14 FlowFiles to merge after 5 minutes thus removing the back pressure that allowed nect 6 files to be processed by upstream processor. 2. The connection between the FetchHDFS and PutSFTP processor has no back pressure being applied (neither object threshold or size threshold has been reached or exceeded), so the FetchHDFS is scheduled to execute. The execution resulted in a single FlowFile larger then the 1 GB size threshold, so back pressure would be applied as soon as that 100 GB file was queued. As soon as the putSFTP successfully executed and moved the FlowFile to one of it's downstream relationships, the FetchHDFS would have been allowed to get scheduled again. There are also processor that do execute on batches of files in a single execution. The list and split based processors like listFile and splitContent are good examples. It is possible that the listFile processor performs a listing execution containing in excess of 10,000 object threshold. Since no back pressure is being applied that execution will be successful and list create all 10,000+ FlowFiles that get transferred to the downstream connection. Back pressure will then be applied until the number of FlowFiles drops back below the threshold. That means as soon as it drops to 9,999 back pressure would be lifted and the listFile processor would be allowed to execute. In your mergeContent example you did the proper edit to object size threshold to allow more FlowFiles to queue in the upstream connection to your mergeContent. If you left the downstream connection containing the "merged" relationship with default size threshold, back pressure would have been applied as soon as the merged FlowFile was added to the connection since its merged size exceeded the 1 GB default size threshold. PRO TIP: You mentioned that your daily merge size may vary from 10 GB to 300GB for your mergeContent. How to handle this in the most efficient way depends really on the number of FlowFiles and no so much on the size of the FlowFiles. Only thing to keep in in mind with size thresholds is the content_repository size limitations. The total disk usage by the content repository is not equal to the size of the actively queued FlowFiles on the canvas due to the fact the content is immutable once created and how NiFi stores FlowFile's content in claims. NiFi holds FlowFile attributes/metadata in NiFi's heap memory for better performance (swapping thresholds exist to help prevent Out of Memory issues but impact performance when swapping is happening). NiFi sets object threshold at 10,000 because swapping does not happen at that default size. When merging batches of FlowFiles in very large number you can get better performance from two MergeContent processors in series instead of just one. To help you understand above more, I recommend reading the following two articles: https://community.cloudera.com/t5/Community-Articles/Dissecting-the-NiFi-quot-connection-quot-Heap-usage-and/ta-p/248166 https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418 Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
04-25-2024
06:16 AM
1 Kudo
@AlexisRub NiFi has never offered an embedded user authentication management feature until the more recent single-user-provider authentication. This provider was only introduced in order for Apache NiFi to support HTTPS out-of-the-box default setup. Over the years since Apache NiFi was open sourced the community noticed unsecured (previous out-of-box default) exposed on the internet, so a decision was made to change the out-of-the-box setup to be secured. A secured NiFi requires that all users/clients are both authenticated and authorized. The Single-User-Provider was introduced to simplify access to a secured NiFi for evaluation purposes. This authentication provider as you have noticed does not support multiple users. The corresponding single-user-authorizer found in the authorizers.xml configuration also does not support multi-user authorization. This authorizer simply provides the single-user-provider user complete and full authorized access to everything in the NiFi. This provider also does not support NiFi clusters. For a multi-user environment or clustered NiFi a different method of external authentication and authorization must be used. Apache NiFi provides support for numerous user/client authentication beyond just single-user, LDAP, and kerberos listed in the User Authentication section of the admin guide. Worth noting is that a secured NiFi requires a keytore and truststore and NiFi will generate the keystore and truststore files with self-signed clientAuth/ServerAuth certifcate if the keystore an truststore do not already exist at startup. When NiFi is secured (HTTPS enabled and valid keystore and truststore configured) and no additional authentication methods have been configured, user/client authentication is required through the TLS exchange. This means that when you try to access the NiFi UI via yoru browser NiFi will respond to the browser (client) within the TLS exchange that a clientAuth certificate is "REQUIRED". If one is not provided the connection is closed. When additional authenication methods are configured NiFi will instead "WANT" a clientAuth certificate. If the browser does not present a client certificate, NiFi moves on to next configured authentication method. I wanted to point out the above since certifcates are probably the next easiest way to setup a multi-user authenticated access. This would require you generating a unique clientAuth certificate for each unique user. These clientAuth certicates would either be self signed or signed by some certificate authority. If self signed the public cert for each would need to be added to the NiFi truststore file. If signed by some authority, only that signing authorities trust chain would need to be added to NiFi's truststore. The unique users would then load their client certifcate into their browser so it could be presented in the mutual TLS exchange with yoru NiFi. In order to authorize multiple users, you would need to stop using the default single-user-authorizer and instead use the StandardManagedAuthorizer. This authorization provider will allow you to define yoru initial admin user (this user will be granted the minimum required admin authorizations. So initially this would be only user authorized to access the NiFi UI. Once access, this initial admin user can define additional user and group identities directly from the NiFi UI to which Authorization policies can be defined. Granting the same policies also granted to your initial admin user will establish a second admin user's authorizations. More information on the various policies and what they grant can be found here in the Configuring Users & Access Policies section of the admin guide. That being said, I typically setup OpenLDAP and use the ldap-provider for authentication. But this requires that you have somewhere to install this (perhaps on same server with NiFi). The advantage here is you do not need to mess with the NiFi truststore. You can also use this ldap server for multiple instance of NiFi and NiFi-Registry. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
04-23-2024
05:56 AM
@s198 The two most common scenarios for this type of failure are: 1. File already exists with same name when trying to rename. Typically resolved by using an update attribute when a failure exists to modify the filename. Perhaps use the nextInt NiFi expression Language function to add an incremental number to filename or in your case modify the time by adding a few milliseconds to it. 2. Some process is consuming the dot (.) filename before the putSFTP processor has renamed it. This requires modifying the downstream process to ignore dot files. While it is great that run duration and run schedule increases appear to resolve this issue, I think you are dealing will a millisecond raise condition and these two options will not always guarantee success here. Best option is to programmatically deal with the failures with a filename attribute modification or change who you are uniquely naming your files if possible. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
04-22-2024
12:11 PM
@s198 - Do you have the full stack trace from the nifi-app.log when the rename fails? - Is it always the same exact stack trace? - Have tried putting this processor class in DEBUG via the NiFi logback.xml to see what additional logging it may produce when the exception occurs? Thanks, Matt
... View more
04-22-2024
06:13 AM
@manishg Not sure what version of Apache NiFi you are using here. I would not recommend using the InferAvroSchema processor. Depending on your use case there may be better options. Most record reader like (CSVReader) have that ability in infer schema From the output provided you have a CSV file that is 44 bytes in size. According to the InferAvroSchema processor documentation: When inferring from CSV data a "header definition" must be present either as the first line of the incoming data or the "header definition" must be explicitly set in the property "CSV Header Definition". A "header definition" is simply a single comma separated line defining the names of each column. The "header definition" is required in order to determine the names that should be given to each field in the resulting Avro definition. Does your content here meet the requirements of the InferAvroSchema processor? Do you see same issue if you try to infer schema via the CSVReader controller service? These two different components do not infer schema in the same way. The InferAvroSchema is not part of the Apache NiFi and utilizes the Kite SDK which is no longer being maintained. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
04-18-2024
01:30 PM
1 Kudo
@s198 I think step one would be looking more into the failures. Are the failures always with rename of dot file? Put SFTP is writing to a dot file (hidden file) and then upon write completion moves file from .xyz to xyz. You also never shared your complete putSFTP processor configuration. 1. Did you inspect the SFTP server log for any logging related to the failures you encountered? 2. What is being done with the files once placed on the SFTP server? Is there some other process consuming them from there? 3. Any chance that other process is consuming the dot files (hidden files) before NiFi has a chance to rename them? 4. Any of the FlowFiles queued have the same "filename" attribute as another FlowFile or a file already present on the target SFTP server? (this is a common issue where the file of same name still exists on the target when the other is written as dot file and then rename fails. Then on retry some process consumed the duplicate and the new is then successful on rename). As far as option 3 and 4 go, both introduce some latency in your dataflow. with (3) the processor only get scheduled once every 30 seconds. So FlowFiles will queue up every 30 seconds. The putSFTP processor has a batch setting for how many FlowFiles will get processed in that execution. If more FlowFiles are queued then that batch setting, he extras will sit until next time processor is scheduled. My concern is that latency introduced my options 3 and 4 may simply be masking the actual issue needing to be addressed. With (4) the processor gets scheduled as fast as possible, but when it executes the thread remains active for 500ms working on as many FlowFiles as possible in the single execution. Then at 500ms it close out that thread and the processor (assuming run schedule of 0) would immediate schedule the processor again. As far as which is better, it is about getting best performance throughput with least amount of latency. Data volumes, sizes, etc come it play here. I typically favor option 4 myself. But if option 3 still works for you but with a much lower runs schedule (30 secs is a lot of latency for a continues flow) Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more