About MattWho

MattWho · ‎02-24-2017

@Mourad Chahri Hbase runs on top of HDFS in HDP. The only service that is part of HDF that is not in HDP is NiFi. NiFi can send and retrieve data with and HDP HDFS with Hbase without both services needing to be installed on the same nodes/hosts.

MattWho · ‎02-24-2017

@Mourad Chahri "how to install HDF on the same cluster , because i wanna use HDF and HDP" HDF does not need to be installed on the same hardware as HDP in order to have the software packages send data to one another. For example, HDF NiFi includes the hadoop client libraries needed to send/get data from HDP HDFS. All you need to provide NiFi is the core-sites.xml and HDFS-sites.xml files. No need to install Hadoop (HDFS) clients on the NiFi nodes/hosts or have HDP HDFS installed on the same nodes/hosts. Thanks, Matt

MattWho · ‎02-24-2017

@Mourad Chahri Different Ambari servers can not own the same hosts/nodes. Ambari agents which are installed on each node are configured to communicate with a single Ambari Server.

MattWho · ‎02-24-2017

@Pradhuman Gupta Backpressure has kicked in on your dataflow. Every new connection by default has a default backpressure object threshold of 10,000 FlowFiles. When Backpressure is reached on a connection, the connection is highlighted in red and the backpressure bar (left = object threshold and right = Size threshold) will show which threshold has reached 100%. Once backpressure is applied, the component (processor) directly upstream of that connection will no longer run. As you can see in your screenshot above the "success" from your PutSplunk processor is applying backpressure. As a result the PutSplunk processor is no longer getting scheduled to run by the NiFi controller. Since it is no longer executing, FlowFiles began to queue on the connection between your TailFile and PutSplunk processor. Once backpressure kicked in here as well, the TailFile processor was stopped as well. If you clear the backpressure on the "success" connection between your PutSplunk and PutEmail processor, your dataflow will start running again. You can adjust the backpressure threshold by right clicking on a connection and selecting "configure". (The configure option is on available if the processors on both sides of a connection are stopped) In addition to adjusting backpressure settings, you also have the option of setting "file expiration" on a connection. File expiration dictates how old a FlowFile in a given connection can be. If the FlowFile has existed in your NiFi (not how long it has been in that specific connection) for longer then the configured time, it is purged from your dataflow. This setting if set aggressive enough could help keep your "success" relationship clean enough to avoid back pressure. Thanks, Matt

MattWho · ‎02-23-2017

@Oliver Meyn You are correct that the Site-To-Site connection and authorizations is handled at the server level and not at the user level. There is no configuration change you can make that would change this behavior. The authorization level is allowing server A to communicate and send data to serverB. Users play no role in the S2S data transfer process. I am not sure how this enhancement would work. Setting the authorization level of S2S down to the user level would require adding these users to serverB which may not be desirable. Also what if ServerA has a process group with the RPG that is authorized by many users? Would the expectation be that every on of those users then needs to be added/authorized to serverB? I suggest opening an apache Jira against NiFi to raise additional discussion around this topic. Thanks, Matt

MattWho · ‎02-23-2017

There is a two part process before any access to NiFi UI is possible: 1. Authentication: By default NiFi will use a user/server's SSL certificate when provided in the connection to authenticate. When NO user/server certificate is presented, NiFi will then look for a Kerberos TGT (If Spnego has been configured in NiFi). Finally, if neither of the above where present in the connection, NiFi will use the login identity provider (if configured). Login identity providers include either ldap or kerberos. With both of these options, NiFi will present users with a login screen. 2. Authorization: Authorization is the mechanism that controls what features and components authenticated users are granted access. The default authorizer NiFi will use is the internal file based authorizer. There is an option to configure NiFi to use Ranger as the authorizer instead. The intent of this article is not to discuss how to setup NiFi to use any of the Authentication or Authorizer options. This article covers how to modify what identity is passed two the Authorizer after any one of the authentication mechanism is successful. What is actually passed to the authorizer varies depending on which Authentication method is in use. SSL certificates: Default, always enabled, and always checked first NiFi uses the full DN from the certificate. Spnego (kerberos): Always on when enabled and only used if a SSL Certificate was not present in connection. NiFi uses the full user principal. ldap-provider (option in login-identity-providers): Always on once configured and only used if both SSL certificate and TGT (if Spnego was enabled) are not present in connection. Default configuration of ldap-provider will use the full DN returned by LDAP upon successful authentication. (USE_DN Identity Strategy) Can be configured to pass the username used to login instead. (USE_USERNAME Identity Strategy) Kerberos-provider (option in login-identity-providers): Always on once configured and only used if both SSL certificate and TGT (if Spnego was enabled) are not present in connection. The kerberos-provider will use the use the user full principal upon successful authentication. (USE_DN Identity Strategy) Whether you choose to use the built in file based authorizer or optional configure you NiFi to use Ranger instead, users must be added and granted various access policies. Adding users using either full a DN or users principal can be both annoying and prone to errors since the authorizer is case sensitive and white spaces are valid characters. This is where NiFi's identity mapping optional configurations come in to play. Identity mapping takes place after successful authentication and before authorization occurs. It gives you the ability to take the returned value from all four of the authentication methods and pass them through 1 or more mappings to produce a simple resulting value which is then passed to your authorizer. The identity mapping properties are configured in NiFi's nifi.properties file and consist of two parts to each mapping you define: nifi.security.identity.mapping.pattern.<user defined>= nifi.security.identity.mapping.value.<user defined>= The mapping pattern takes a java regular expression as input with the expectation that one of more capture groups are defined in that expression. One or more of those capture groups are then used in the mapping value to create the desired final result that will be passed to your configured authorizer. **** Important note: If you are implementing pattern mapping on a existing NiFi cluster that is already running securely, the newly added mappings will be run against the DNs from the certificates created for your nodes and the Initial Admin Identity value you originally configured. If any of your mapping match, a new value is going to passed to your authorizer which means you may lose access to your UI. Before adding any mapping make sure you have added the new mapped value users to your NiFi and authorized them so you do not lose access. By default NiFi includes 2 example identity mappings commented out in the NiFi properties file: You can add as many Identity mapping pattern and value as you like to accommodate all your various user/server authentication types. Each must have a unique identifier. In the above examples the unique identifiers are "dn" and "kerb". You could add for example "nifi.security.identity.mapping.pattern.dn2=" and "nifi.security.identity.mapping.value.dn2=" If you are using Ambari to install and manage your NiFi cluster (HDF 2.x version), you can find the 2 sample identity mapping properties under "Advanced nifi-properties": If you want add additional mappings beyond the above 2 via ambari, these would be added via the "Custom nifi-properties" config section. Simply click the "Add Property..." link to add your new mappings. The result of any successful authentication is run through all configured identity mapping until a match is found. If no match is found the full DN or user principal is passed to the authorizer. Let's take a look at a few examples: User/server DN or Principal Identity Mapping Pattern Identity Mapping Value Result passed to authorizer CN=nifi-server-01.openstacklocal, OU=NIFI ^CN=(.*?), OU=(.*?)$ $1 nifi-server-01 CN=nifi-01, OU=SME, O=mycp, L=Fulton, ST=MD, C=US ^CN=(.*?), OU=(.*?), O=(.*?), L=(.*?), ST=(.*?), C=(.*?)$ $1@$2 nifi-01@SME nifi/instance@MY.COMPANY.COM ^(.*?)/instance@(.*?)$ $1@$2 nifi@MY.COMPANY.COM cn=nifi-user1,ou=SME,dc=mycp,dc=com ^cn=(.*?),ou=(.*?),dc=(.*?),dc=(.*?)$ $1 nifi-user1 JohnDoe@MY.COMPANY.COM ^(.*?)@(.*?)$ $1 JohnDoe ^EMAILADDRESS=none@none.com, CN=nifi-user2, OU=SME, O=mycp, L=Fulton, ST=MD, C=US ^EMAILADDRESS=(.*?), CN=(.*?), OU=(.*?), O=(.*?), L=(.*?), ST=(.*?), C=(.*?)$ $2 nifi-user2 As you can see from the above examples, using NiFi's pattern mapping ability with simplify authorizing new users via either NiFi's default file based authorizer or using Ranger.

MattWho · ‎02-23-2017

NiFi works with FlowFiles. Every FlowFile that exists consists of two parts, FlowFile content and FlowFile Attributes. While the FlowFile's content lives on disk in the content repository, NiFi holds the "majority" of the FlowFile attribute data in the configured JVM heap memory space. I say "majority" because NiFi does swapping of Attributes to disk on any queue that contains over 20,000 FlowFiles (default, but can be changed in the nifi.properties). Once your NiFi is reporting OutOfMemory (OOM) Errors, there is no corrective action other then restarting NiFi. If changes are not made to your NiFi or dataflow, you are surely going to encounter this issue again and again. The default configuration for JVM heap in NiFi is only 512 MB. This value is set in the nifi-bootstrap.conf file. # JVM memory settings java.arg.2=-Xms512m java.arg.3=-Xmx512m While the default may work for some dataflow, they are going to be undersized for others. Simply increasing these values till you stop seeing (OOM) error should not be your immediate go to solution. Very large heap sizes could also have adverse impacts on your dataflow as well. Garbage collection will take much longer to run with very large heap sizes. While garbage collections occurs, it is essentially a stop the world event. This amount to dataflow stoppage for the length time it takes for that to complete. I am not saying that you should never set large heap sizes because sometimes that is really necessary; however, you should evaluate all other options first.... NiFi and FlowFile attribute swapping: NiFi already has a built in mechanism to help reduce the overall heap footprint. The mechanism swaps FlowFiles attributes to disk when a given connection's queue exceeds the configured threshold. These setting are found in the nifi.properties file: nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager nifi.queue.swap.threshold=20000 nifi.swap.in.period=5 sec nifi.swap.in.threads=1 nifi.swap.out.period=5 sec nifi.swap.out.threads=4 Swapping however will not help if your dataflow is so large that queues are how everywhere, but still have not exceeded the threshold for swapping. Anytime you decrease the swap threshold, more swapping can occur which may result in some throughput performance. So here are some other things to check for... So some common reason for running out of heap memory include: 1. High volume dataflow with lots of FlowFiles active any any given time across your dataflow. (Increase configured nifi heap size in bootstrap.conf to resolve) 2. Creating a large number of Attributes on every FlowFile. More Attributes equals more heap usage per FlowFile. Avoid creating unused/unnecessary Attributes on FlowFiles. (Increase configured nifi heap size in bootstrap.conf to resolve and/or reduce the configured swap threshold) 3. Writing large values to FlowFile Attributes. Extracting large amounts of content and writing it to an attribute on a FlowFile will result in high heap usage. Try to avoid creating large attributes when possible. (Increase configured nifi heap size in bootstrap.conf to resolve and/or reduce the configured swap threshold) 4. Using the MergeContent processor to merge a very large number of FlowFiles. NiFi can not merge FlowFiles that are swapped, so all these FlowFile's attributes must be in heap when the merge occurs. If merging a very large number of FlowFiles is needed, try using two MergeContent processors in series with one another. Have first merge a max of 20,000 FlowFiles and the second then merge those 10,000 FlowFile files in to even larger bundles. (Increase configured nifi heap size in bootstrap.conf also help) 5. Using the SplitText processor to split one File in to a very large number of FlowFiles. Swapping of a large connection queue will not occur until after the queue has exceeded swapping threshold. The SplitTEXT processor will create all the split FiLowFiles before committing them to the success relationship. Most commonly seen when SpitText is used to split a large incoming FlowFile by every line. It is possible to run out of heap memory before all the splits can be created. Try using two SplitText processors in series. Have the first split the incoming FlowFiles in to large chunks and the second split them down even further. (Increase configured nifi heap size in bootstrap.conf also help) Note: There are additional processors that can be used for splitting and joining large numbers of FlowFiles, so the same approach as above should be followed for those as well. I only specifically commented on the above since they are more commonly seen being used to deal with very large numbers of FlowFiles.

MattWho · ‎02-23-2017

@mayki wogno Every FlowFile that exists consists of two parts, FlowFile content and FlowFile Attributes. While the FlowFile's content lives on disk in the content repository, NiFi holds the "majority" of the FlowFile attribute data in the configured JVM heap memory space. I say "majority" because NiFi does swapping of Attributes to disk on any queue that contains over 20,000 FlowFiles (default, but can be changed in the nifi.properties). So some common reason for running out of heap memory include: 1. High volume dataflow with lots of FlowFiles active any any given time across your dataflow. (Increase configured nifi heap size in bootstrap.conf to resolve) 2. Creating a large number of Attributes on every FlowFile. More Attributes equals more heap usage per FlowFile. (Increase configured nifi heap size in bootstrap.conf to resolve and/or reduce the configured swap threshold) 3. Writing large values to FlowFile Attributes. Extracting large amounts of content and writing it to an attribute on a FlowFile will result in high heap usage. Try to avoid creating large attributes when possible. (Increase configured nifi heap size in bootstrap.conf to resolve and/or reduce the configured swap threshold) 4. Using the MergeContent processor to merge a very large number of FlowFiles. NiFi can not merge FlowFiles that are swapped, so all these FlowFile's attributes must be in heap when the merge occurs. If merging a very large number of FlowFiles is needed, try using two MergeContent processors in series with one another. Have first merge a max of 10,000 FlowFiles and the second then merge those 20,000 FlowFile files in to even larger bundles. (Increase configured nifi heap size in bootstrap.conf also help) 5. Using the SplitText processor to split one File in to a very large number of FlowFiles. Swapping of a large connection queue will not occur until after the queue has exceeded swapping threshold. The SplitTEXT processor will create all the split FiLowFiles before committing them to the success relationship. Most commonly seen when SpitText is used to split a large incoming FlowFile by every line. It is possible to run out of heap memory before all the splits can be created. Try using two SplitText processors in series. Have the first split the incoming FlowFiles in to large chunks and the second split them down even further. (Increase configured nifi heap size in bootstrap.conf also help) Thanks, Matt

MattWho · ‎02-23-2017

@Ramakrishnan V You will need to use the following curl command to obtain a token for your LDAP user: curl 'https://<hostname>:<port>/nifi-api/access/token' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' --data 'username=admin&password=admin' --compressed --insecure Once you have your token you will need to pass that token as the bearer of all subsequent curl command you execute against the NiFi api by adding teh following to your curl commads: -H 'Authorization: Bearer eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJjbj1hZG1pbixkYz1leGFtcGxlLGRjPW9yZyIsImlzcyI6IkxkYXBQcm92aWRlciIsIm F1ZCI6IkxkYXBQcm92aWRlciIsInByZWZlcnJlZF91c2VybmFtZSI6ImFkbWluIiwia2lkIjoxLCJleHAiOjE0ODcxNDM2OTEs ImlhdCI6MTQ4NzEwMDQ5MX0.GwwJ0Yz4_KXUAMNIH500jw8YcIk3e6ZdcT3LCrrkHjc' The odd string above is an example of the token you will get back from the first command. Thanks, Matt

MattWho · ‎02-22-2017

@Joe Petro Yes this is very doable... NiFi automatically creates a FlowFile Attribute called "filename" on every FlowFile that is created. You can use this existing attribute to specify teh target HDFS directory: Of course you will want to modify the above for the complete target path. Thanks, Matt

Member Since	‎07-30-2019 10:41 AM
Last Visited
Posts	3,133
Kudos received	1560

Cloudera Community

Re: Flowfile stuck in Wait in EnforceOrder process...

Re: Untrusted proxy error Authentication Failed o....

Re: REST API Configuration for NiFi 2.0

Re: Fileflow penalized for certain time before all...

Re: Nifi : Implement Sleep Mechanism in nifi witho...

Re: Install apache NIFI with ambari on existing HD...

Re: Install apache NIFI with ambari on existing HD...

Re: Install apache NIFI with ambari on existing HD...

Re: Why PutSplunk stopped picking the data from Qu...

Re: Preserve identity in multi-tenant NiFi over si...

How-to simplify User Management in NiFi through us...

How to address JVM OutOfMemory errors in NiFi.

Re: nifi : Java heap space

Re: How to authenticate when LDAP is configured to...

Re: Nifi SFTP Mutiple Files into Mutiple HDFS dire...