About MattWho

MattWho · ‎05-26-2017

HDF or CFM best practices guide to configuring your system and NiFi for high performance dataflows. Note: The recommendation outlined in this article are for the NiFi service and apply whether the NiFi service is being deployed/managed via Ambari, Cloudera Manager, or neither. NiFi is pre-configured to run with very minimal configuration needed out of the box. Simply edit the nifi.properties file located in the conf directory by adding a sensitive props key (used to encode all sensitive properties in flow) and you are ready to go. This may get you up and running, but that basic configuration is far from ideal for high volume/high performance dataflows. While the NiFi core itself does not have much impact on the system in terms of memory, disk/network I/O, or CPU, the dataflows that you build have the potential of impacting those performance metrics quite a bit. Some NiFi processors can be CPU, I/O or memory intensive. In fact some can be intensive in all three areas. We have tried to identify if a processor is intensive in any of these areas within the processor documentation (Processor documentation can be found by clicking on Help in the upper right corner of the NiFi UI or right clicking an instance and selecting Usage) This guide is not intended to cover how to optimize your data flow design or how to fine-tune your individual processors. In this guide, we will focus on the initial setup of the application itself and focus on the areas where you can improve performance by changing some out of the box default values. We will take a closer look at many properties in both the nifi.properties and the bootstrap.conf files. We will point out the properties that should be changed from their default values and what the changes will gain you. For those properties that we do not cover in this guide, please refer to Admin Guide (This can be found by clicking on Help in the upper right corner of your NiFi UI or on the Apache website at https://nifi.apache.org/docs.html) for more information. Before we dive into these two configuration files lets take a 5,000-foot view of the NiFi application. NiFi is a Java application and therefore runs inside a JVM, but that is not the entire story. The application has numerous repositories that each have their own very specific function. The database repository keeps track of all changes made to the dataflow using the UI and all users who have authenticated in to the UI (Only when NiFi is setup securely using https). The FlowFile repository holds all the FlowFile attributes about every FlowFile that is being processed within the data flows built using NiFi UI. This is one of the most important repositories. If it becomes corrupt or runs out of disk space, state of the FlowFiles can become lost (This guide will address ways to help avoid this from occurring). This repository holds information like where a FlowFile is currently in the dataflow, a pointer to location of the content in the content repository, and any FlowFile attributes associated or derived from the content. It is this repository that allows NiFi to pickup where it left off in the event of a restart of the application (Think unexpected power outage, inadvertent user/software restart, or upgrade). The content repository is where all the actual content of the file being processed by NiFi resides. It is also where multiple historical versions of the content are held if data archiving has been enabled. This repository is can be very I/O intensive depending one the type of data flows the user constructs. Finally, we have the provenance repository, which keeps track of the lineage on both current and past FlowFiles. Through the use of the provenance UI data can be downloaded, replayed, tracked and evaluated at numerous points along the dataflow path. Download and replay are only possible if a copy of the content still exists in the content repository. Like any application, the overall performance is governed by the performance of its individual components. So we will focus on getting the most out of these components in this guide. nifi.properties file: We will start by looking at the nifi.properties file located in the conf directory of the NiFi installation. The file is broken up in to sections (Core Properties, State Management, H2 Settings, FlowFile Repository, Content Repository, Provenance Repository, Component Status Repository, Site to Site properties, Web Properties, Security properties, and Cluster properties). The various properties that make up each of these sections come pre-configured with default values. # Core Properties #: There is only one property in this section that has an impact on NiFi, performance. nifi.bored.yield.duration=10 millis This property is designed to help with CPU utilization by preventing processors, that are using the timer driven scheduling strategy, from using excessive CPU when there is no work to do. The default 10-millisecond value already makes a huge impact on cutting down on CPU utilization. Smaller values equate to lower latency, but higher CPU utilization. So depending on how important latency is to your overall dataflow, increasing the value here will cut down on overall CPU utilization even further. There is another property in this section that can be changed. It does not have an impact on NiFi performance but can have an impact browser performance. nifi.ui.autorefresh.interval=30 sec This property sets the value at which the latest statistics, bulletins and flow revisions will be refreshed pushed to connected browser sessions. In order to reload the complete dataflow the user must trigger a refresh. Decreasing the time between refreshes will allow bulletins to present themselves to the user in a timelier manner; however, doing so will increase the network bandwidth used. The number of concurrent users accessing the UI compounds this. We suggest keeping the default value and only changing it if closer to real–time bulletin or statistics reporting in the UI is needed. The user can always manually trigger a refresh at any time by right clicking on any open space on the graph and selecting “refresh status”. # H2 Settings There are two H2 databases used by NiFi. A user DB (keeps track of user logins when the NiFi is secured) and history DB (keeps track of all changes made on the graph) that stay relatively small and require very little hard drive space. The default installation path of <root-level-nifi-dir>/database_repository would result in the directory being created at the root level of your NiFi installation (same level as conf, bin, lib, etc directories). While there is little to no performance gain by moving this to a new location, we do recommend moving all repositories to a location outside of the NiFi install directories. It is unlikely that the user or history DBs will change between NiFi releases so moving it outside of the base install path can simplify upgrading; allowing you retain the user and component history information after upgrading. # FlowFile Repository This is by far the most important repository in NiFi. It maintains state on all FlowFiles currently located anywhere in the data flows on the NiFi UI. If it should become corrupt, the potential exists that you could lose access to some or all of the files currently being worked on by NiFi. The most common cause of corruption is the result of running out of disk space. The default configuration again has the repository located in <root-level-nifi-dir>/flowfile_repository. nifi.flowfile.repository.directory=./flowfile_repository For the same reason as mention for the database repository, we recommend moving this repository out of the base install path. You will also want to have the FlowFile repository located on a disk (high performance RAID preferably) that is not shared with other high I/O software. On high performance systems, the FlowFile repository should never be located on the same hard disk/RAID as either the content repository or provenance repository if at all possible. NiFi does not move the physical file (content) from processor to processor, FlowFiles serve as the unit of transfer from one processor to the next. In order to make that as fast as possible, FlowFiles live inside the JVM memory. This is great until you have so many FlowFiles in your system that you begin to run out of JVM memory and performance takes a serious dive. To reduce the likelihood of this happening, NiFi has a threshold that defines how many FlowFiles can live in memory on a single connection queue before being swapped out to disk. Connections are represented in the UI as arrows connecting two processors together. FlowFiles can queue on these connections while waiting for the follow-on processor. nifi.queue.swap.threshold=20000 If the number of total FlowFiles in any one-connection queue exceeds this value, swapping will occur. Depending on how much swapping is occurring, performance can be affected. If queues having an excess of 20,000 FlowFiles is the norm rather then the occasional data surge for your data flow, it may be wise to increase this value. You will need to closely monitor your heap memory usage and increase it as well to handle these larger queues. Heap size is set in the bootstrap.conf file that will be discussed later. # Content Repository Since the content (the physical file that was ingested by NiFi) for every FlowFile that is consumed by NiFi is placed inside the content repository, the hard disk that this repository is loaded on will experience high I/O on systems that deal with high data volumes. As you can see, once again the repository is created by default inside the NiFi installation path. nifi.content.repository.directory.default=./content_repository The content repository should be moved to its own hard disk/RAID. Sometimes even having a single dedicated high performance RAID is not enough. NiFi allows you to configure multiple content repositories within a single instance of NiFi. NiFi will then round robin files to however many content repositories are defined. Setting up additional content repositories is easy. First remove or comment out the line above and then add a new line for each content repository you want to add. For each new line you will need to change the word “default” to some other unique value: nifi.content.repository.directory.contS1R1=/cont-repo1/content_repository nifi.content.repository.directory.contS1R2=/cont-repo2/content_repository nifi.content.repository.directory.contS1R3=/cont-repo3/content_repository In the above example, you can see the default was changed to contS1R1, contS1R2, and contS1R3. The ‘S’ stands for System and ‘R’ stands for Repo. The user can use whatever names they prefer as long as they are unique to one another. Each hard disk / RAID would be mapped to either /cont-repo1, /cont-repo2, or /cont-repo3 using the above example. This division of I/O load across multiple disks can result in significant performance gains and durability in the event of failures. *** In a NiFi cluster, every Node can be configured to use the same names for their various repositories, but it is recommend to use different names. It makes monitoring repo disk utilization in a cluster easier within the NiFi UI. Multiple repositories with the same name will be summed together under the system diagnostics window. So in a cluster you might use contS2R1, contS2R2, contS3R1, contS3R2, and so on. # Provenance Repository . (UPDATED) Similar to the content repository, the provenance repository can use a large amount of disk I/O for writing and reading provenance events because every transaction within the entire dataflow the affects either the FlowFile or content has a provenance event created The UI does not restrict the number of users who can make simultaneous queries against the provenance repository. I/O increases with the number of concurrent queries taking place. The default configuration has the provenance repository being created inside the NiFi installation path just like the other repositories: nifi.provenance.repository.directory.default=./provenance_repository It is recommended that the provenance repository is also located on its own hard disk /RAID and does not share its disk with any of the other repositories (database, FlowFile, or content). Just like the content repository, multiple provenance repositories can be defined by providing unique names in place of ‘default’ and paths: nifi.provenance.repository.directory.provS1R1=/prov-repo1/provenance_repository nifi.provenance.repository.directory.provS1R2=/prov-repo2/provenance_repository In the above example, you can see the default was changed to provS1R1 and provS1R2. The ‘S’ stands for System and ‘R’ stands for Repo. The user can use whatever names they prefer as long as they are unique to one another. Each hard disk / RAID would be mapped to either /prov-repo1 or /prov-repo2 using the above example. This division of I/O load across multiple disks can result in significant performance gains and limited durability in the event of failure when dealing with large amounts of provenance data. The ability to query provenance information on files has great value to a variety of users. The number of threads that are available to conduct provenance queries is defined by: nifi.provenance.repository.query.threads=2 On systems where numerous users may be making simultaneous queries against the provenance repository, it may be necessary to increase the number of threads allocated to this process. The default value is 2. The number of threads to use for indexing Provenance events so that they are searchable can be adjusted by editing this line: nifi.provenance.repository.index.threads=1 The default value is 1. For flows that operate on a very high number of FlowFiles, the indexing of Provenance events could become a bottleneck. If this is the case, a bulletin will appear indicating, "The rate of the dataflow is exceeding the provenance recording rate. Slowing down flow to accommodate." If this happens, increasing the value of this property may increase the rate at which the Provenance Repository is able to process these records, resulting in better overall throughput. Keep in mind that as you increase the number of threads allocated to one process, you reduce the number available to another. So you should leave this at one unless the above error message is encountered. When provenance queries are performed, the configured shard size has an impact on how much of the heap is used for that process: nifi.provenance.repository.index.shard.size=250 MB Large values for the shard size will result in more Java heap usage when searching the Provenance Repository but should provide better performance. The default value is 500 MB. This does not mean that 500 MB of heap is used. Keep in mind that if you increase the size of the shard, you may also need to increase the size of your overall heap, which is configured in the bootstrap.conf file discussed later. *** IMPORTANT: configured shard size MUST be smaller then 50% of the total provenance configured storage size (nifi.provenance.repository.max.storage.size <-- default 1 GB). So only increase to values over 500 MB if you have also increased repos size to a minimum double this new value. Note that default is exactly 50% of default storage size, which leaves room for a race condition that can lead to issues with provenance. Either shard size should be reduced or max storage increased from 1 GB to a higher value. Most users increase the max storage size. NOTE: While NiFi provenance cannot be turned off, the implementation can be changed from "PersistentProvenanceRepository" to "VolatileProvenanceRepository". This move all provenance storage into heap memory. There is one config setting for how much heap can be used: nifi.provenance.repository.buffer.size=100000 This value is set it bytes. Of course by switching to Volatile you will have very limited provenance history and all provenance is lost anytime the NiFi JVM is restarted, but it remove some of the overhead associate to provenance if you have no need to retain this information. (NEW) nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository As of Apache NiFI 1.8.0, the WriteAheadProvenanceRepository is now the default. The WriteAheadProvenance implementation was first introduced in Apache NIFi 1.2.0 and HDF 3.0.0. The original PersistentProvenance implementation proved to be a poor performer with large volumes which would impact the throughput of every dataflow on the canvas. ***Important: While there is a large gain in performance with this new implementation, we have observed issues when using the Java G1GC Garbage collector with this new implementation. This is due to numerous Lucene related bugs in Java 8 that were not addressed until Java 9. So we recommend now commenting out the G1GC line in the NiFi bootstrap.conf file to avoid this. ------------------------------------ The following is just an example system configuration for a high performance Linux system. In a NiFi cluster, every Node would be configured the same way. (The hardware specs are the same for other operating systems; however, disk partitioning will vary): CPU: 24 - 48 cores Memory: 64 -128 GB Hard Drive configuration: (1 hardware RAID 1 array) (2 or more hardware RAID 10 arrays) RAID 1 array (This could also be a RAID 10) logical volumes: -/ -/boot -/home -/var -/var/log/nifi-logs <-- point all your NiFi logs (logback.xml) here -/opt <-- install NiFi here under a sub-directory -/database-repo <-- point NiFi database repository here -/flowfile-repo <-- point NiFi flowfile repository here 1st RAID 10 array logical volumes: -/cont-repo1 <-- point NiFi content repository here 2nd RAID 10 array logical volumes: - /prov-repo1 <-- point NiFi provenance repository here 3rd RAID 10 array logical volumes (recommended): - / cont-repo2 <-- point 2nd NiFI content repository here 4th + RAID arrays logical volumes (optional): *** Use additional RAID arrays to increase the number of content and/or provenance repositories available to your NiFi instance. Bootstrap.conf file: The bootstrap.conf file in the conf directory allows users to configure settings for how NiFi should be started. This includes parameters, such as the size of the Java Heap, what Java command to run, and Java System Properties. This files comes pre-configured with default values. We will focus on just the properties that should be changed when installing NiFi for the purpose of high volume / high performance dataflows. The bootstrap.conf file is broken in to sections just like the nifi.properties file. Since NiFi is a Java application it requires that Java be installed. NiFi requires Java 7 or later. Some recommended settings only apply if you are using Java 7. We will only highlight the sections that need changing. # JVM memory settings This section is used to control the amount heap memory to use by the JVM running NiFi. Xms defines the initial memory allocation, while Xmx defines the maximum memory allocation for the JVM. As you can see the default values are very small and not suitable for dataflows of any substantial size. We recommend increasing both the initial and maximum heap memory allocations to at least 4 GB or 8 GB for starters. java.arg.2=-Xms8g java.arg.3=-Xmx8g If you should encounter any “out of memory” errors in your NiFi app log, this is an indication that you have either a memory leak or simply insufficient memory allocation to support your dataflow. When using large heap allocations, garbage collection can be a performance killer (most noticeable when Major Garbage Collection occurs against the Old Generation objects). When using larger heap sizes it is recommended that a more efficient garbage collector is used to help reduce the impact to performance should major garbage collection occur. Be mindful that setting an extremely large HEAP size can have its pluses and minuses. It give you more room to process large batches of FlowFiles concurrently, but if your flow has Garbage collection issues it could take considerable time for a full garbage collection to complete. If these stop the world events (Garbage collection) takes longer then the configured heartbeat threshold configured in a NiFi cluster setup, you will end up with nodes becoming disconnected form your cluster. You can configure your NiFi to use the G1 garbage collector by uncommenting the above line. (G1GC is default in HDF 2.x versions) #java.arg.13=-XX:+UseG1GC (NEW) We no longer recommend using G1GC as the garbage collector due to issues observed when using the recommended writeAheadProvenance implementation introduced in Apache NIFi 1.2.0 (HDF 3.0.0). From these versions forward this line should be commented out. JAVA 8 tuning: # Java 8 (This applies to all versions of HDF 1.x, HDF 2.x, NiFi 0.x, and NiFi 1.x when running with Java 8. HDF 2.x and NiFi 1.x require a minimum of Java 8. You can add the above lines to your NiFi bootstrap.conf file to improve performance if you are running with a version of Java 8. Increase the Code Cache size by uncommenting this line: java.arg.codecachesize=-XX:ReservedCodeCacheSize=256m The code cache is memory separate from the heap that contains all the JVM bytecode for a method compiled down to native code. If the code cache fills, the compiler will be switched off and will not be switched back on again. This will impact the long running performance of NiFi. The only way to recover performance is a restart the JVM (restart NiFi). So by removing the comment on this line, the code cache size is increased to 256m,which should be sufficient to prevent the cache from filling up. The default code cache is defined by the Java version you are running, but can be as little as 32m. An additional layer of protection comes by removing the comment on the following 2 lines: java.arg.codecachemin=-XX:CodeCacheMinimumFreeSpace=10m java.arg.codecacheflush=-XX:+UseCodeCacheFlushing This parameter establishes a boundary for how much of the code cache can be used before flushing of the code cache will occur to prevent it from filling and resulting in the stoppage of the compiler. Conclusion: At this point you should have an environment configured so it that will support the building of a high volume / high performance data flow using NiFi. While this guide has set the stage for a solid performing system, performance can also be significantly governed by the configuration of your dataflows themselves. It is very important to understand what the various configuration parameters on every processor mean and how to read the information the processors are providing during run time (In, Read/Write, Out and Tasks/Time). Through the interpretation of this provided information, a user can adjust the parameters so that a dataflow is optimized as well, but that is information for another guide.

MattWho · ‎04-21-2017

Note: This article was written as of HDF 2.1.2 release which is based off Apache NiFi 1.1.0. NiFi has several processors that can be used to retrieve FlowFiles from a SFTP server. It is important to understand the different capabilities each provides so you know when you should be using one vs another. Lets start with the oldest of the available processors: GetSFTP The GetSFTP processor is the original processor introduced for retrieving files from a remote SFTP server. The processor has the following configurable properties: Things to know about this processor: 1. Disadvantage: This processor does not retain any state. This means it does keep track of which files it has previously retrieved. So if property "Delete Original" is set to "false", this processor will continue to retrieve the same file over and over again. 2. This processor is not cluster friendly, meaning in a NiFi cluster it should be set to run on "primary node only" so that every node in the cluster is not competing to pull the same data. 3. In NiFi cluster, the data retrieved by the GetSFTP processor should be redistributed across all nodes before further processing is done. This spreads out the work load to all nodes so the "primary node" is not doing all the work. The above sample shows a "Remote Process Group" being used to redistribute the data from the GetSFTP processor to all nodes within the cluster. *** As of Apache NiFi 1.8 release a new capability has been added to NiFi to facilitate easy redistribution of FlowFiles without needing to use a RPG. It can be done by simply configuring any connection to perform the redistribution. This blog explains that new capability very well: https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster *** Disadvantage: All data content is pulled in to the primary node before being distributed across cluster. Disadvantage: Since GetSFTP processor needs to delete source file in order to prevent continuos consumption, the data is unavailable to other users/servers. Note: This processor has been deprecated in favor of the newer ListSFTP and FetchSFTP processors. It still exists to maintain backwards compatibility for NiFi users. ----------------------------------------------------- Now let's talk about the ListSFTP and Fetch SFTP processors and what disadvantages above were solved by these processors. The ListSFTP processor is designed to connect to a SFTP server just like GetSFTP did; however, it does not actual retrieve the data. Instead it creates a 0 byte FlowFile for every File it lists from the SFTP server. The FetchSFTP processor takes these 0 byte FlowFiles as input and actually retrieves the associated data and inserts it in to the FlowFile content a that time. I know it sounds like we just replicated what GetSFTP processor does but split it between two processors, but there are key advantages to doing it this way. 1. The ListSFTP processor does maintain state across a NiFi cluster. So if you leave do not delete the source data, this processor will not pickup the same data multiple times like the GetSFTP processor will. 2. While the ListSFTP processor is still not cluster friendly, meaning it should be run on Primary Node only, the FetchSFTP processor is cluster friendly. The ListSFTP processor should be used to create the 0 byte FlowFiles and then use a Remote Process Group to distribute these FlowFiles across your cluster. Then the Fetch SFTP processor is used on every node to retrieve the actual FlowFile content from the SFTP server. *** As of Apache NiFi 1.8 release a new capability has been added to NiFi to facilitate easy redistribution of FlowFiles without needing to use a RPG. It can be done by simply configuring any connection to perform the redistribution. This blog explains that new capability very well: https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster *** Advantage: The Primary node is no longer using excess resources writing all content to its content repository before redistributing the FlowFiles to all nodes in the cluster. Advantage: Cluster wide state allows Primary node to switch within your NiFi cluster and the ListSFTP processor will still not list the same files twice. Advantage: Being able to leave files on SFTP server, allows that data be consumed by other end users/systems. Disadvantage: Using an RPG to redistribute the listSFTP generated FlowFiles can be annoying since the remote input port the RPG sends to must exist on root canvas level. So if flow is nested down in a sub-process group, you must build a flow that feeds the load-balanced FlowFiles back down in to that sub process group. --------------------------------------- You will find within NiFi several other examples of where processors have been deprecated for newer list/fetch based processors. Thank you, Matt

MattWho · ‎02-23-2017

There is a two part process before any access to NiFi UI is possible: 1. Authentication: By default NiFi will use a user/server's SSL certificate when provided in the connection to authenticate. When NO user/server certificate is presented, NiFi will then look for a Kerberos TGT (If Spnego has been configured in NiFi). Finally, if neither of the above where present in the connection, NiFi will use the login identity provider (if configured). Login identity providers include either ldap or kerberos. With both of these options, NiFi will present users with a login screen. 2. Authorization: Authorization is the mechanism that controls what features and components authenticated users are granted access. The default authorizer NiFi will use is the internal file based authorizer. There is an option to configure NiFi to use Ranger as the authorizer instead. The intent of this article is not to discuss how to setup NiFi to use any of the Authentication or Authorizer options. This article covers how to modify what identity is passed two the Authorizer after any one of the authentication mechanism is successful. What is actually passed to the authorizer varies depending on which Authentication method is in use. SSL certificates: Default, always enabled, and always checked first NiFi uses the full DN from the certificate. Spnego (kerberos): Always on when enabled and only used if a SSL Certificate was not present in connection. NiFi uses the full user principal. ldap-provider (option in login-identity-providers): Always on once configured and only used if both SSL certificate and TGT (if Spnego was enabled) are not present in connection. Default configuration of ldap-provider will use the full DN returned by LDAP upon successful authentication. (USE_DN Identity Strategy) Can be configured to pass the username used to login instead. (USE_USERNAME Identity Strategy) Kerberos-provider (option in login-identity-providers): Always on once configured and only used if both SSL certificate and TGT (if Spnego was enabled) are not present in connection. The kerberos-provider will use the use the user full principal upon successful authentication. (USE_DN Identity Strategy) Whether you choose to use the built in file based authorizer or optional configure you NiFi to use Ranger instead, users must be added and granted various access policies. Adding users using either full a DN or users principal can be both annoying and prone to errors since the authorizer is case sensitive and white spaces are valid characters. This is where NiFi's identity mapping optional configurations come in to play. Identity mapping takes place after successful authentication and before authorization occurs. It gives you the ability to take the returned value from all four of the authentication methods and pass them through 1 or more mappings to produce a simple resulting value which is then passed to your authorizer. The identity mapping properties are configured in NiFi's nifi.properties file and consist of two parts to each mapping you define: nifi.security.identity.mapping.pattern.<user defined>= nifi.security.identity.mapping.value.<user defined>= The mapping pattern takes a java regular expression as input with the expectation that one of more capture groups are defined in that expression. One or more of those capture groups are then used in the mapping value to create the desired final result that will be passed to your configured authorizer. **** Important note: If you are implementing pattern mapping on a existing NiFi cluster that is already running securely, the newly added mappings will be run against the DNs from the certificates created for your nodes and the Initial Admin Identity value you originally configured. If any of your mapping match, a new value is going to passed to your authorizer which means you may lose access to your UI. Before adding any mapping make sure you have added the new mapped value users to your NiFi and authorized them so you do not lose access. By default NiFi includes 2 example identity mappings commented out in the NiFi properties file: You can add as many Identity mapping pattern and value as you like to accommodate all your various user/server authentication types. Each must have a unique identifier. In the above examples the unique identifiers are "dn" and "kerb". You could add for example "nifi.security.identity.mapping.pattern.dn2=" and "nifi.security.identity.mapping.value.dn2=" If you are using Ambari to install and manage your NiFi cluster (HDF 2.x version), you can find the 2 sample identity mapping properties under "Advanced nifi-properties": If you want add additional mappings beyond the above 2 via ambari, these would be added via the "Custom nifi-properties" config section. Simply click the "Add Property..." link to add your new mappings. The result of any successful authentication is run through all configured identity mapping until a match is found. If no match is found the full DN or user principal is passed to the authorizer. Let's take a look at a few examples: User/server DN or Principal Identity Mapping Pattern Identity Mapping Value Result passed to authorizer CN=nifi-server-01.openstacklocal, OU=NIFI ^CN=(.*?), OU=(.*?)$ $1 nifi-server-01 CN=nifi-01, OU=SME, O=mycp, L=Fulton, ST=MD, C=US ^CN=(.*?), OU=(.*?), O=(.*?), L=(.*?), ST=(.*?), C=(.*?)$ $1@$2 nifi-01@SME nifi/instance@MY.COMPANY.COM ^(.*?)/instance@(.*?)$ $1@$2 nifi@MY.COMPANY.COM cn=nifi-user1,ou=SME,dc=mycp,dc=com ^cn=(.*?),ou=(.*?),dc=(.*?),dc=(.*?)$ $1 nifi-user1 JohnDoe@MY.COMPANY.COM ^(.*?)@(.*?)$ $1 JohnDoe ^EMAILADDRESS=none@none.com, CN=nifi-user2, OU=SME, O=mycp, L=Fulton, ST=MD, C=US ^EMAILADDRESS=(.*?), CN=(.*?), OU=(.*?), O=(.*?), L=(.*?), ST=(.*?), C=(.*?)$ $2 nifi-user2 As you can see from the above examples, using NiFi's pattern mapping ability with simplify authorizing new users via either NiFi's default file based authorizer or using Ranger.

MattWho · ‎02-23-2017

NiFi works with FlowFiles. Every FlowFile that exists consists of two parts, FlowFile content and FlowFile Attributes. While the FlowFile's content lives on disk in the content repository, NiFi holds the "majority" of the FlowFile attribute data in the configured JVM heap memory space. I say "majority" because NiFi does swapping of Attributes to disk on any queue that contains over 20,000 FlowFiles (default, but can be changed in the nifi.properties). Once your NiFi is reporting OutOfMemory (OOM) Errors, there is no corrective action other then restarting NiFi. If changes are not made to your NiFi or dataflow, you are surely going to encounter this issue again and again. The default configuration for JVM heap in NiFi is only 512 MB. This value is set in the nifi-bootstrap.conf file. # JVM memory settings java.arg.2=-Xms512m java.arg.3=-Xmx512m While the default may work for some dataflow, they are going to be undersized for others. Simply increasing these values till you stop seeing (OOM) error should not be your immediate go to solution. Very large heap sizes could also have adverse impacts on your dataflow as well. Garbage collection will take much longer to run with very large heap sizes. While garbage collections occurs, it is essentially a stop the world event. This amount to dataflow stoppage for the length time it takes for that to complete. I am not saying that you should never set large heap sizes because sometimes that is really necessary; however, you should evaluate all other options first.... NiFi and FlowFile attribute swapping: NiFi already has a built in mechanism to help reduce the overall heap footprint. The mechanism swaps FlowFiles attributes to disk when a given connection's queue exceeds the configured threshold. These setting are found in the nifi.properties file: nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager nifi.queue.swap.threshold=20000 nifi.swap.in.period=5 sec nifi.swap.in.threads=1 nifi.swap.out.period=5 sec nifi.swap.out.threads=4 Swapping however will not help if your dataflow is so large that queues are how everywhere, but still have not exceeded the threshold for swapping. Anytime you decrease the swap threshold, more swapping can occur which may result in some throughput performance. So here are some other things to check for... So some common reason for running out of heap memory include: 1. High volume dataflow with lots of FlowFiles active any any given time across your dataflow. (Increase configured nifi heap size in bootstrap.conf to resolve) 2. Creating a large number of Attributes on every FlowFile. More Attributes equals more heap usage per FlowFile. Avoid creating unused/unnecessary Attributes on FlowFiles. (Increase configured nifi heap size in bootstrap.conf to resolve and/or reduce the configured swap threshold) 3. Writing large values to FlowFile Attributes. Extracting large amounts of content and writing it to an attribute on a FlowFile will result in high heap usage. Try to avoid creating large attributes when possible. (Increase configured nifi heap size in bootstrap.conf to resolve and/or reduce the configured swap threshold) 4. Using the MergeContent processor to merge a very large number of FlowFiles. NiFi can not merge FlowFiles that are swapped, so all these FlowFile's attributes must be in heap when the merge occurs. If merging a very large number of FlowFiles is needed, try using two MergeContent processors in series with one another. Have first merge a max of 20,000 FlowFiles and the second then merge those 10,000 FlowFile files in to even larger bundles. (Increase configured nifi heap size in bootstrap.conf also help) 5. Using the SplitText processor to split one File in to a very large number of FlowFiles. Swapping of a large connection queue will not occur until after the queue has exceeded swapping threshold. The SplitTEXT processor will create all the split FiLowFiles before committing them to the success relationship. Most commonly seen when SpitText is used to split a large incoming FlowFile by every line. It is possible to run out of heap memory before all the splits can be created. Try using two SplitText processors in series. Have the first split the incoming FlowFiles in to large chunks and the second split them down even further. (Increase configured nifi heap size in bootstrap.conf also help) Note: There are additional processors that can be used for splitting and joining large numbers of FlowFiles, so the same approach as above should be followed for those as well. I only specifically commented on the above since they are more commonly seen being used to deal with very large numbers of FlowFiles.

MattWho · ‎02-08-2017

What is Content Repository Archiving? There are three properties in the nifi.properties file that deal with the archiving on content in the NiFi Content Repository. The default NiFi values for these are shown below: nifi.content.repository.archive.max.retention.period=12 hours nifi.content.repository.archive.max.usage.percentage=50% nifi.content.repository.archive.enabled=true The purpose of content archiving is so that users can view and/ or replay content via the provenance UI that is no longer in their dataflow(s). The configured values do not have any impact on the amount of provenance history that is retained. If content associated to a particular provenance event no longer exists in the content archive, provenance will simply report to the user that the content is not available. The content archive is kept in within the same directory or directories where you have configured your content repository(s) to exist. When a "content claim" is archived, that claim is moved in to an archive subdirectory within the same disk partition where it originally existed. This keeps archiving from affecting NiFi's content repository performance with unnecessary writes that would be associated with moving archived Files to a new disk/partition for example. The configured max retention period tells NiFi how long to keep a archived "content claim" before purging it from the content archive directory. The configured max usage percentage tells NiFi at what point it should start purging archived content claims to keep the overall disk usage at or below the configured percentage. This is a soft limit. Let's say the content repository is at 49% usage. A 4GB content claim then becomes eligible for archiving. Once at time this content claim is archived the usage may exceed the configured 50% threshold. At the next checkpoint, NiFi will remove the oldest archived content claim(s) to bring the overall disk usage back or below 50%. So this value should never be set to 100%. The above two properties are enforced using an or policy. Whichever max occurs first will trigger the purging of archived content claims. Let's look at a couple examples: Example 1: Here you can see that are Content Repository has 35% of its disk consumed by Content Claims that are tied to FlowFiles still active somewhere in one or more dataflows on the NiFi canvas. This leaves 15% of the disk space to be used for archived content claims. Example 2: Here you can see that the amount of Content Claims still active somewhere within your NiFi flow has exceeded 50% disk usage in the content repository. As such you can see there are no archived content claims. The content repository archive setting have no bearing on how much of the content repository disk will be used by active FlowFiles in your dataflow(s). As such, it is possible for your content repository to still fill to 100% disk usage. *** This is the exact reason why as a best practice you should avoid co-locating your content repository with any of the other Nifi repositories. It should be isolated to a disk(s) that will not affect other applications or the OS should it fill to 100%. What is a Content Claim? I have mentioned "Content Claim" throughout this article. Understanding what a content claim will help you understand your disk usage. NiFi stores content in the content repository inside claims. A single claim can contain the content from 1 to many FlowFiles. The property that governs how a content claim is built are is found in the nifi.properties file. The default configuration value is shown below: nifi.content.claim.max.appendable.size=50 KB The purpose of content claims is to make the most efficient use of disk storage. This is especially true when dealing with many very small files. The configured max appendable size tells NiFi at what point should NiFi stop appending additional content to an existing content claim before starting a new claim. It does not mean all content ingested by NiFi must be smaller than 50 KB. It also does not mean that every content claim will be at least 50 KB in size. Example 1: Here you can see we have a single content claim that contains both large and small pieces of content. The overall size has exceeded the 10 MB max appendable size because at the time NiFi started streaming that final piece of content in to this claim the size was still below 10 MB. Example 2: Here we can see we have a content claim that contains only one piece of content. This is because once the content was written to this claim, the claim exceeded the configured max appendable size. If your dataflow(s) deal with nothing but files over 10 MB in size, all your content claims will contain only one piece of content. So when is a "Content Claim" moved to archive? A content claim cannot be moved into the content repository archive until none of the pieces of content in that claim are tied to a FlowFile that is active anywhere within any dataflow on the NiFi canvas. What this means is that the reported cumulative size of all the FlowFiles in your dataflows will likely never match the actual disk usage in your content repository. This cumulative size is not the size of the content claims in which the queued FlowFiles reside, but rather just the reported cumulative size of the individual pieces of content. It is for this reason that it is possible for a NiFi content repository to hit 100% disk usage even if the NiFi UI reports a total cumulative queued data size of less than that. Take Example 1 from above. Assuming the last piece of content written to that claim was 100 GB in size, all it would take is for one of those very small pieces of content in that same claim to still exist queued in a dataflow to prevent this claim from being archived. As long as a FlowFile still points at a content claim, that entire content claim can not be purged. When fine tuning your NiFi default configurations, you must always take into consideration your intended data. if you are working with nothing, but very small OR very large data, leave the default values alone. If you are working with data that ranges greatly from very small to very large, you may want to decrease the max appendable size and/or max flow file settings. By doing so you decrease the number of FlowFiles that make it into a single claim. This in turns reduces the likelihood of a single piece of data keeping large amounts of data still active in your content repository.

MattWho · ‎02-02-2017

How to access your secured NiFi instance or cluster: So you have secured your NiFi instance (meaning you have configured it for https access) and now you are trying to access the https web UI. Once NiFi is secured, any entity interacting with the UI will need to successfully authenticate and then be authorized to access the particular NiFi resource(s). As of HDF 2.1.1 or Apache NiFi 1.1.0, NiFi supports authentication via user certificates (default - always enabled), Kerberos/Spnego, or username and password based authentication via LDAP, LDAPS, or kerberos. The intent of this article is not to cover the authentication process, but rather to cover the initial admin authorization process. We assume for this article that authentication is successful. How do you know? A quick look in the nifi-user.log will tell you if your users authentication was successful. Following successful authentication comes NiFi Authorization. NiFi authorization can be handled by NiFi's default built in file based authorizer or handled externally via Ranger. This article will cover the default built in file based authorizer. NiFi's built in file based authorization: There are four files in NiFi that contain properties used by NiFi file based authorizer: nifi.properties authorizers.xml users.xml authorizations.xml We will start by showing what role each of these files plays in NiFi user/server authorization. nifi.properties file (Pattern Mapping): The nifi.properties file a lot of key/value pairs that are used my NiFi's core. This file happens to be where users can define identity mapping patterns. These properties allow normalizing user identities such that identities coming from different identity providers (certificates, LDAP, Kerberos) can be treated the same internally in NiFi. It is the resulting value from a matching pattern that is passed to the configured authorizer (NiFi's file based or Ranger). NiFi includes two examples that are commented out in the nifi.properties file; however, you can add as many unique identity mapping patterns as you need. nifi.security.identity.mapping.pattern.dn=^CN=(.*?),OU=(.*?),O=(.*?),L=(.*?),ST=(.*?),C=(.*?)$ nifi.security.identity.mapping.value.dn=$1@$2 nifi.security.identity.mapping.pattern.kerb=^(.*?)/instance@(.*?)$ nifi.security.identity.mapping.value.kerb=$1@$2nifi.security.identity.mapping.value.kerb=$1@$2 All mapping patterns use java regular expressions. They are case sensitive and white space matters between elements. for example ^CN=(.*?),OU=(.*?),O=(.*?),L=(.*?),ST=(.*?),C=(.*?)$ would match on: CN=John Doe,O=SME,L=Bmore,ST=MD,C=US but would not match on: cn=John Doe, o=SME, l=Bmore, st=MD, c=US (Note the lowercase and white spaces) Assuming a DN of CN=John Doe,O=SME,L=Bmore,ST=MD,C=US the associated mapping value would return John Doe@SME Additional mapping patterns can be added simply by adding additional properties to the nifi.properties file similar to the above examples except each must have a unique value following nifi.security.identity.mapping.pattern. or nifi.security.identity.mapping.value. . For example: nifi.security.identity.mapping.pattern.dn2=^CN=(.*?), OU=(.*?)$ nifi.security.identity.mapping.value.dn2=$1 While you can create as many mapping patterns as you like, it is important to make sure that you do not have more then one pattern that can match your incoming user/server identity. Those user identities are run against every configured pattern and only the last pattern that matches will be applied. authorizers.xml (Default configuration supports file-provider) This file is where you will setup your NiFi file based authorizer. It is this file in which you will find the "Initial Admin Identity" property. It is very important that you correctly define an "Initial Admin Identity" before starting your secured https NiFi for the first time. (no worries if you have not, I will discuss how to fix issues when you did not or had a typo). If you are securing a NiFi cluster, you will also need to configure a "Node Identity x" for each node in your cluster (where "x" is sequential numbers). *** Don't forget to remove the comment lines "" from around these properties. So, what values should I be providing to these properties? That depends on a few factors: Which authentication method did I use? User/server/node certificates (default) - User certificates will have a DN in the certificate for that user. This full DN is evaluated by any configured identity mapping patterns and the result is passed to the authorizer. NiFi nodes can only use server certificates to authenticate. Each server is issued server certificates and the Full DNs form those certificates are evaluated by any configured identity mapping patterns and the result is passed to the authorizer. Kerberos/Spnego - The users principal is evaluated by any configured identity mapping patterns and the result is passed to the authorizer. LDAP/LDAPS - Users are presented with a login screen. NiFi's LDAP configuration can be setup to pass either the DN returned by LDAP for the user (default) or the username (supplied at login screen). This return is evaluated by any configured identity mapping patterns and the result is passed to the authorizer. Kerberos - Users are presented with a login screen. The user's principal is evaluated by any configured identity mapping patterns and the result is passed to the authorizer. Did I setup identity pattern mappings? If no identity mapping patterns were defined, the full return from the configured authentication is passed to the authorizer. If the user/server identity fails to match on any of the defined identity mapping patterns, the full return from the configured authentication is passed to the authorizer. What ever the final resulting value will be is what needs to be entered in the "Initial Admin Identity" and " Node Identity x" properties: Let's assume the following user/server DNs and that multiple identity mappings were setup in the nifi.properties file: Sample entity DN: Configured Identity Mapping Pattern: Configured Identity Mapping Value: Resulting value: cn=JohnDoe,ou=SME,dc=work ^cn=(.*?),ou=(.*?),dc=(.*?),dc=(.*?)$ $1 JohnDoe CN=nifi-server1, OU=NIFI ^CN=(.*?), OU=(.*?)$ $1 nifi-server1 CN=nifi-server2, OU=NIFI ^CN=(.*?), OU=(.*?)$ $1 nifi-server2 Your authorizers.xml file would then look like this: The values configured here will be used to seed the users.xml and authorizations.xml files. users.xml The users.xml file is produced the first time and only the first time NiFi is started securely (https). This file will contain your "Initial Admin Identity" and all your "Node Identity x" configured values: authorizations.xml The Authorizations.xml file is produced the first time and only the first time NiFi is started securely (https). NiFi will assign the access policies needed by your "Initial Admin Identity" and "Node Identity x" users/servers: As you can see, your "Initial Admin Identity" user was granted the following resources/access policies: Resource: NiFi UI Access Policy: Details: /flow (R) view the UI All users including admin must have this access policy in order to access and view the NiFi UI. /restricted-components (W) access restricted components This access policy allows granted users the ability to add/configure NiFi components tagged as restricted on the canvas. /tenants (R and W) access users/user groups (view and modify) This access policy allows granted users the ability to add/remove/modify new users and user groups to NiFi for authorization. /policies (R and W) access all policies (view and modify) This access policy allows granted users the ability to add/remove various access policies for any users and user groups. /controller (R and W) access the controller (view and modify) This access policy allows granted users the ability to view/modify the controller including Reporting Tasks, Controller Services, and Nodes in the Cluster You may notice a few additional access policies were granted to your admin user. This will only happen if the NiFi you have secured already had a an existing flow.xml.gz file. In this case the "Initial Admin Identity" is also granted access to view and modify the dataflow at the NiFi root canvas level. By default all sub NiFi process groups inherit their access policies from the parent process group. This effectively gives the admin user full access to the dataflow. The "Node Identity x" servers are granted the following access policies: Resource: NiFi UI Access Policy: Details: /proxy (R and W) proxy user requests (view and modify) Allows proxy machines to send requests on the behalf of others. All nodes in a NiFi cluster must be granted this access policy so users can make changes to the cluster while logged in to any of the NiFi Cluster's nodes. What do I do if i messed up my "Initial Admin Identity" or "Node Identity x" values when setting up my authorizers.xml file? Its is common for users to incorrectly configure the value for the either the "Initial Admin Identity" or "Node Identity x" values. Common mistakes include bad mapping patterns, case sensitivity issues (LDAP DNs always have the cn, ou, etc values in lowercase), white space issues between DN sections (cn=JohnDoe, ou=sme versus cn=JohnDoe,ou=sme). You can use the nifi-user.log to identify the actual value being passed to the authorizer and then follow these steps: Correct your authorizers.xml configuration Delete or rename the current users.xml and authorizations.xml files on all of your NiFi nodes. restart all your nifi nodes NiFi will generate new users.xml and authorizations.xml files from the corrected authorizers.xml file. You should only follow this procedure to correct issues when first setting up a secured NiFi. If an Admin was able to previously access your NiFi's canvas and add new users and granted access policies to those users, all those users and access policies will be lost if you delete the users.xml and authorizations.xml files. Thanks, Matt

MattWho · ‎01-27-2017

With HDF 2.x, Ambari can be used to deploy a NiFi cluster. Lets say you deployed a 2 node cluster and want to go back at a later time and add an additional NiFi node to the cluster. While the process is very straight forward when your NiFi cluster has been setup non-secure (http), the same is not true if your existing NiFi cluster has been secured (https). Below you will see an existing 2 node secured NiFi cluster that was installed via Ambari: STEP 1: Add new host through Ambari. You can skip this step if the host you want to install the additional NiFi node on is already managed by your Ambari. STEP 2: Under "Hosts" in Ambari click on the host form the list where you want to install the new NiFi node. The NiFi component will be in a "stopped" state after it is installed on this new host. *** DO NOT START NIFI YET ON NEW HOST OR IT WILL FAIL TO JOIN CLUSTER. *** STEP 3: (This step only applies if NiFi's file based authorizer is being used) Before starting this new node we need to clear out some NiFi Configs. This step is necessary because of how the NiFi application starts. When NiFi starts it looks for the existence of a users.xml and authorizations.xml files. If they do not exist, it uses the configured "Initial Admin Identity" and "Node identities (1,2,3, etc...)" to build the users.xml and authorizations.xml files. This causes a problem because your existing clusters users.xml and authorizations.xml files likely contain many more entires by now. Any mismatches in these files will prevent a node from being able to join the cluster. If these configurations are not present, the new node will grab them from the cluster it joins. Below shows what configs need to be cleared in NiFi: *Note: Another option is to simply copy the users.xml and authorizations.xml files from an existing cluster node to the new node before starting the new node. STEP 4: (Do this step if using Ambari metrics) When a new node is added by Ambari and Ambari metrics are also enabled, Ambari will create a flow.xml.gz file that contains just the ambari reporting task. Later when this node tries to join the cluster, the flow.xml.gz files between this new node and the cluster will not match. This mis-match will trigger the new node to fail to join cluster and shut back down. In order to avoid this problem the flow.xml.gz file must be copied from one of the cluster's existing nodes to this new node. STEP 5: Start NiFi on this new node. After the node has started, it should successfully join your existing cluster. If it fails, the nifi-app.log will explain why, but will likely be related to one of the above configs not being cleared out causing the users.xml and authorizations.xml files to get generated rather then inherited from the cluster. If that is the case you will need to fix the configs and delete those files manually before restarting the node again. STEP 6: While you cluster is now up and running with the additional node, but you will notice you cannot open the UI of that new node without getting an untrusted proxy error screen. You will however still be able to access your other two node's UIs. So we need to authorize this new node in your cluster. A. If NiFi handles your authorizations, follow this procedure: 1. Log in to the UI of one of the original cluster nodes. The "proxy user requests" access policies is needed to allow users to access the UI of your nodes. NOTE: There may be additional component level access policies (such as "view the data" and "modify the data") you may also want to authorize this new node for. B. If Ranger handles your NiFI authorizations, follow this procedure: 1. Access the Ranger UI: 2. Click Save to create this new user for your new node. Username MUST match exactly with the DN displayed in the untrusted proxy error screen. 3. Access the NiFi service Manager in Ranger and authorize your new node to your existing access policies as needed: You should now have a full functional new node added to your pre-existing secured NiFi cluster that was deployed/installed via Ambari.

MattWho · ‎11-02-2016

@apsaltis I might suggest we make a few changes to this article: 1. The link you have for installing HDF talks about installing HDF 2.0. HDF 2.0 is based off Apache NiFi 1.0. Since MiNiFi is built from Apache NiFi 0.6.1, the dataflows built and templated for conversion into MiNiFi YAML files must also be built using an Apache 0.6 based NiFi install. (I see in your example above you did just that but this needs to be made clear) 2. I would never recommend setting nifi.remote.input.socket.host= to "localhost". When a NiFi or MiNiFi connects to another NiFi via S2S, the destination NiFi will return the value set for this property along with the value set for nifi.remote.input.socket.port=. In your example that means the source MiNiFi would then try to send FlowFiles to localhost:10000. This is ONLY going to work if the destination NIFi is located on the same server as MiNiFi. 3. You should also explain why you are changing nifi.remote.input.secure= from true to false. Changing this is not a requirement of MiNiFi, it is simply a matter of preference (If set to true, both MiNiFi (source) and NiFi (destination) must be setup to run securely over https). In your example you are working with http only. 4. While doable, one should never route the "success" relationship from any processor back on to itself. If you have reached the end of your dataflow, you should auto-terminate the "success" relationship. 5. I am not clear what you are telling me to do based on this line under step 5: Start the From MiNiFi Input Port 6. When using the GenerateFlowFile processor in an example flow it is important to recommend that user set a run schedule other then "0 sec". Since MiNiFi is Apache 0.6.1 based there is no default backpressure on connections and with a run schedule of "0 sec" it is very likely this processor will produce FlowFiles much faster then they can be sent across S2S. This will eventual fill the hard drive of the system running MiNiFi. An even better recommendation would be to make sure they set back pressure between the GenerateFlowFile processor and the Remote Process Group (RPG). That way even if someone stops the NiFi and not the MiNiFi they don't fill their MiNiFI hard drive. Thanks, Matt

MattWho · ‎04-26-2016

There are additional items that will need to be taken in to consideration if you are running a NiFi cluster. See the following for more details: https://community.hortonworks.com/content/kbentry/28180/how-to-configure-hdf-12-to-send-to-and-get-data-fr.html

MattWho · ‎04-18-2016

Setting up Hortonworks Dataflow (HDF) to work with kerberized Kafka in Hortonworks Data Platform (HDP) HDF 1.2 does not contain the same Kafka client libraries as the Apache NiFi version. HDF Kafka libraries are specifically designed to work with the Kafka versions supplied with HDP. The following Kafka support matrix breaks down what is supported in each Kafka version: *** (Apache) refers to the Kafka version downloadable from the Apache website. For newer versions of HDF (1.1.2+), NiFi uses zookeeper to maintain cluster wide state. So the following only applies if this is a HDF NiFi cluster: 1. If a NiFi cluster has been setup to use a kerberized external or internal zookeeper for state, every kerberized connection to any other zookeeper would require using the same keytab and principal. For example a kerberized embedded zookeeper in NiFi would need to be configured to use the same client keytab and principal you want to use to authenticate with a say a Kafka zookeeper. 2. If a NiFi cluster has been setup to use a non-kerberized zookeeper for state, it cannot then talk to any other zookeeper that does use kerberos. 3. If a NiFi cluster has been setup to use a kerberized zookeeper for state, it cannot then communicate with any other non-kerberized zookeeper. With that being said, the PutKafka and GetKafka processors do not have properties like the HDFS processors for keytab and principal. The keytab and principal would be defined in the same jaas file used if you setup HDF cluster state management. So before even trying to connect to kerberized Kafka, we need to get NiFi state management configured to use either an embedded or external kerberized zookeeper for state. Even if you are not clustered right now, you need to take the above in to consideration if you plan on upgrading to being a cluster later: —————————————— NiFi Cluster Kerberized State Management: https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#state_management Lets assume you followed the above linked procedure to setup your NiFi cluster to create an embedded zookeeper. At the end of the above procedure you will have made the following config changes on each of your NiFi Nodes: 1. Created a zookeeper-jaas.conf file On nodes with embedded zookeeper, it will contain something like this: Server { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="./conf/zookeeper-server.keytab" storeKey=true useTicketCache=false principal="zookeeper/myHost.example.com@EXAMPLE.COM"; }; Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="./conf/nifi.keytab" storeKey=true useTicketCache=false principal="nifi@EXAMPLE.COM"; }; On Nodes without embedded zookeeper, it will look something like this: Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="./conf/nifi.keytab" storeKey=true useTicketCache=false principal="nifi@EXAMPLE.COM"; }; 2. Added a config line to the NiFi bootstrap.conf file: java.arg.15=-Djava.security.auth.login.config=/<path>/zookeeper-jaas.conf *** the arg number (15 in this case) must be unused by any other java.arg line in the bootstrap.conf file 3. Added 3 additional properties to the bottom of the zookeeper.properties file you have configured per the linked procedure above: authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider jaasLoginRenew=3600000 requireClientAuthScheme=sasl ————————————— Scenario 1 : Kerberized Kafka setup for NiFI Cluster: So for scenario one, we will assume you are running on a NiFi cluster that has been setup per the above to use a kerberized zookeeper for NiFi state management. Now that you have that setup, you have the foundation in place to add support for connecting to kerberized Kafka brokers and Kafka zookeepers. The PutKafka processor connects to the Kafka broker and the GetKafka processor connects to the Kafka zookeepers. In order to connect to via Kerberos, we will need to do the following: 1. Modify the zookeeper-jaas.conf file we created when you setup the kerberized state management stuff above: You will need to add a new section to the zookeeper-jass.conf file for the Kafka client: If your NiFi node is running an embedded zookeeper node, your zookeeper-jaas.comf file will contain: Server { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="./conf/zookeeper-server.keytab" storeKey=true useTicketCache=false principal="zookeeper/myHost.example.com@EXAMPLE.COM"; }; Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="./conf/nifi.keytab" storeKey=true useTicketCache=false principal="nifi@EXAMPLE.COM”; }; KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true renewTicket=true serviceName="kafka" useKeyTab=true keyTab="./conf/nifi.keytab" principal="nifi@EXAMPLE.COM"; }; *** What is important to note here is that both the “KafkaClient" and “Client" (used for both embedded zookeeper and Kafka zookeeper) use the same principal and key tab *** *** The principal and key tab for the “Server” (Used by the embedded NiFi zookeeper) do not need to be the same used by the “KafkaClient" and “Client” *** If your NiFi cluster node is not running an embedded zookeeper node, your zookeeper-jaas.comf file will contain: Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="./conf/nifi.keytab" storeKey=true useTicketCache=false principal="nifi@EXAMPLE.COM”; }; KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true renewTicket=true serviceName="kafka" useKeyTab=true keyTab="./conf/nifi.keytab" principal="nifi@EXAMPLE.COM"; }; *** What is important to note here is that both the KafkaClient and the Client (used for both embedded zookeeper and Kafka zookeeper) use the same principal and key tab *** 2. Add additional property to the PutKafka and GetKafka processors: Now all the pieces are in place and we can start our NiFi(s) and add/modify the PutKafka and GetKafka processors. You will need to add one new property by using the on each putKafka and getKafka processors “Properties tab: You will use this same security.protocol (PLAINTEXTSASL) when intereacting with HDP Kafka versions 0.8.2 and 0.9.0. ———————————— Scenario 2 : Kerberized Kafka setup for Standalone NiFi instance: For scenario two, a standalone NiFi does not use zookeeper for state management. So rather then modifying and existing jaas.conf file, we will need to create one from scratch. The PutKafka processor connects to the Kafka broker and the GetKafka processor connects to the Kafka zookeepers. In order to connect to via Kerberos, we will need to do the following: 1. You will need to create a jaas.conf file somewhere on the server running your NiFi instance. This file can be named whatever you want, but to avoid confusion later should you turn your standlone NiFi deployment in to a NiFi cluster deployment, I recommend continuing to name the file zookeeper-jaas.conf. You will need to add the following lines to this zookeeper-jass.conf file that will be used to talk to communicate with the Kerberized Kafka brokers and Kerberized Kafka zookeeper(s) : Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="./conf/nifi.keytab" storeKey=true useTicketCache=false principal="nifi@EXAMPLE.COM”; }; KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true renewTicket=true serviceName="kafka" useKeyTab=true keyTab="./conf/nifi.keytab" principal="nifi@EXAMPLE.COM"; }; *** What is important to note here is that both the KafkaClient and Client configs use the same principal and key tab *** 2. Added a config line to the NiFi bootstrap.conf file: java.arg.15=-Djava.security.auth.login.config=/<path>/zookeeper-jaas.conf *** the arg number (15 in this case) must be unused by any other java.arg line in the bootstrap.conf file 3. Add additional property to the PutKafka and GetKafka processors: Now all the pieces are in place and we can start our NiFi(s) and add/modify the PutKafka and GetKafka processors. You will need to add one new property by using the on each putKafka and getKafka processors “Properties tab: You will use this same security.protocol (PLAINTEXTSASL) when interacting with HDP Kafka versions 0.8.2 and 0.9.0. ———————————————————— That should be all you need to get setup and going…. Let me fill you in on a few configuration recommendations for your PutKafka and getKafka processors to achieve better throughputs: PutKafka: 1. Ignore for now what the documentation says for the Batch Size property on the PutKafka processor. It is really a measure of bytes, so jack that baby up from the default 200 to some much larger value. 2. Kafka can be configured to accept larger files but is much more efficient working with smaller files. The default max messages size accepted by Kafka is 1 MB, so try to keep the individual messages smaller then that. Set the Max Record Size property to the max size a message can be, as configured on your Kafka. Changing this value will not change what your Kafka can accept, but will prevent NiFi from trying to send something to big. 3. The Max Buffer Size property should be set to a value large enough to accommodate the FlowFiles it is being fed. A single NiFi FlowFile can contain many individual messages and the Message Delimiter property can be used to split that large FlowFile content into is smaller messages. The Delimiter could be new line or even a specific string of characters to denote where one message ends and another begins. 4. Leave the run schedule at 0 sec and you may even want to give the PutKafka an extra thread (Concurrent tasks) GetKafka: 1. The Batch Size property on the GetKafka processor is correct in the documentation and does refer to the number of messages to batch together when pulled from a Kafka topic. The messages will end up in a single outputted FlowFile and the configured Message Demarcator (default new line) will be used to separate messages. 2. When pulling data from a Kafka topic that has been configured to allow messages larger than 1 MB, you must add an additional property to the GetKafka processor so it will pull those larger messages (the processor itself defaults to 1 MB). Add fetch.message.max.bytes and configure it to match the max allowed message size set on Kafka for the topic. 3. When using the GetKafka processor on a Standalone instance of NiFi, the number of concurrent tasks should match the number of partitions on the Kafka topic. This is not the case (dispite what the bulletin tell you when it is started) when the GetKafka processor is running on a NIFi cluster. Lets say you have 3 node NiFi cluster. Each Node in the cluster will pull from a different partition at the same time. So if the topic only has 3 partitions you will want to leave concurrent tasks at 1 (indicates 1 thread per NiFi node). If the topic has 6 partitions, set concurrent tasks to 2. Let say the topic has 4 partitions, I would still use one concurrent task. NiFi will still pull from all partitions, the addition partition will be included in a Round Robin fashion. If you were to set the same number of concurrent tasks as partitions in a NiFi cluster, you will end up with only one Node pulling from every partition while your other nodes sit idle. 4. Set your run schedule 500 ms to reduce excessive CPU utilization.

Online	Offline
Last Visited	‎04-06-2026 05:18 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎04-06-2026 05:18 AM
Posts	3,466
Kudos received	1637

Cloudera Community

Re: FetchSMB not fetching all files

Re: Nifi: How to revoke the import and export Temp...

Re: Setting TTL per key when writing to redis

Re: Authorization issue between NiFi and NiFi Regi...

Re: Best Practice for configuring registry flows

HDF/CFM NIFI Best practices for setting up a high ...

How-to: Retrieve files from a SFTP server using Ni...

How-to simplify User Management in NiFi through us...

How to address JVM OutOfMemory errors in NiFi.

Understanding how NiFi's Content Repository Archiv...

Understanding the "Initial Admin Identity" and "No...

HDF 2.x - Adding a new NiFi Node to an existing se...

Re: Getting Started with MiNiFi

Re: Accessing Kerberos enabled Kafka topics using ...

How to configure HDF 1.2 to send to and get data f...