Member since
07-30-2019
2445
Posts
1284
Kudos Received
689
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
222 | 01-17-2023 12:55 PM | |
90 | 01-12-2023 01:30 PM | |
129 | 01-12-2023 12:52 PM | |
114 | 12-20-2022 12:06 PM | |
339 | 12-16-2022 08:53 AM |
09-06-2016
12:29 PM
1 Kudo
@Bojan Kostic
It is not currently possible to add new jars /nars to a running NiFi. A restart is always required to get these newly added items loaded. Upon NiFi startup all the jars/nars are unpacked in to the NiFi work directory. To maintain high availability it is recommended that you use a NiFi cluster. This will allow you to do rolling restarts so that your entire cluster is not down at the same time. If adding new components as part of this rolling update, you will not be able to use those new components until all nodes have been updated. Thanks, Matt
... View more
09-06-2016
12:18 PM
2 Kudos
@David DN Before Site-to-Site (S2S) can be used the following properties must be set in the nifi.properties file on all the Nodes in your NiFi cluster: # Site to Site properties
nifi.remote.input.host=<FQDN of Host> <-- Set to resolveable FQDN by all Nodes
nifi.remote.input.secure=false <-- Set to True on if NiFi is running HTTPS
nifi.remote.input.socket.port=<Port used for S2S) <-- Needs to be set to support Raw/enable S2S
nifi.remote.input.http.enabled=true <-- Set if you want to support HTTP transport
nifi.remote.input.http.transaction.ttl=30 sec A restart of your NiFi instances will be necessary for this change to take affect.
Matt
... View more
09-02-2016
02:02 PM
@INDRANIL ROY Please share how you have your SplitText and RouteText processors configuration. If understand your end goal, you want to take this single files with 10,000,000 entries/lines and route only lines meeting criteria 1 to one putHDFS while route all other lines to another putHDFS? Thanks, Matt
... View more
08-31-2016
08:48 PM
You can also save portions or all of you dataflow in a to NiFi templates that can be exported for use on other NiFi installations. To create a template simply highlight all the components you want in your template (If you highlight a process group, all components within that process group will be added to the template). Then click on the "create template" icon in the upper middle create your template. The Templates manager UI can be used to export and import these templates from your NiFi. It can be access via this icon in the upper right corner of the NiFi UI.
*** Note: NiFi templates are sanitized of any sensitive properties values (A sensitive property value would be any value that would be encrypted. In NiFi that would be any passwords)
Matt
... View more
08-31-2016
08:41 PM
1 Kudo
@Sami Ahmad Every change you make to the NiFi canvas is immediately saved to the flow.xml.gz file. No need to manually initiate a save. Each installation of NiFi provides you with a single UI for building a dataflow. You can build as many different dataflow as you want on this canvas. These different dataflows do not need to be connected in any way. The most common approach to what you are doing is to create a different process group for each of your unique dataflows. To add a new process group to the canvas, drag the process group icon on to the canvas and give it unique name that identifies the dataflow it contains. If you double click on that process group, you will enter it giving you a blank canvas to work with.
So here you can see I have two process groups that are not connected in any way. One contains a dataflow that consists of 6 processors while the other has 89. I can right click on either of these process groups and select either start or stop from the context menu. That start or stop action is applied against every processor within that process group. so this gives you an easy way to stop one dataflow and start another. You could even have both running at the same time. Matt
... View more
08-31-2016
06:24 PM
1 Kudo
@Sami Ahmad An easy way to return NiFi to a blank canvas, is to simply stop NiFi and remove the flow.xml.gz file from the NiFi's conf directory. When you restart your NiFi a new blank flow.xml.gz file will be generated. Any FlowFiles that had existed in the deleted flow will be purged from NiFi when it is started. Alternatively: The error you are seeing is occurring because you are inside a NiFi process group and trying to delete all the components; however, NiFi has detected that there are connections attached to that process group from the process group one level up. NiFi will not allow those components to be removed until there feeding connections are removed first. If you return to the root/top level of your NiFi dataflow you can select the connection entering and existing a process group and delete them. Once they have been deleted, you can select the process group itself and delete it. This will in turn delete all components inside that process group. The deletion of connections will only be allowed if there is no queued FlowFiles in that connection. If there is queued FlowFiles, the FlowFiles must be purged before the connection can be deleted. Matt
... View more
08-31-2016
05:51 PM
1 Kudo
@Sami Ahmad I am not saying it will not work as is; however, without a defined path forward from those two output ports the data will just Queue.
Looking at your screenshot, it does look like the dataflow is producing FlowFiles. If you look at the stats on the process groups "log Generator" and "Data Enrichment" you will see data being produced and queued. The problem is that none of the components inside "Data Enrichment" are running. If you double click on the "Data Enrichment" process group, you will be taken inside of it. There you will see the stopped components, the invalid components, and the ~20,000 queued FlowFiles.
You will need to start all the valid stopped components in this process group to get your data flowing all the way to your two putHDFS processors outside this process group.
There are two output ports in this "Data Enrichment" process group that are invalid. They are not necessary for this tutorial to work. I suggest you stop the "Filter WARN" and "Filter INFO" processors and delete the connections feeding these invalid output ports. If you have already run this flow and data exists queued on these connections you wish to delete, you will need to right click on the connection and select "Empty queue" before you will be able to delete it. These example were not put out by Apache. I will try to find the correct person who wrote this tutorial and see if i can get them to update it. Thanks, Matt
... View more
08-31-2016
04:56 PM
1 Kudo
@Sami Ahmad For starters, I have to agree that the "generate_logs.py" script is not being used in that NiFi template anywhere. The NiFi flow itself has been built to generate some fake log data. Invalid components: Components like NiFi processors, input ports, output ports, controller services, and reporting tasks all have minimum defined requirements that must be met before they are in a "valid" state. Only components that are in a Valid state can be started. Floating your cursor over the invalid icon on a component will show why it is not valid. The Data Enrichment process group in this template has two output ports that have no defined connections making the invalid. Despite the warning you were presented with, all valid components should have been started. You can fix the issue by creating the two missing output connections:
So here you can see I added a new processor (UpdateAttribute with success checked for auto-terminate) and dragged a connection from the Data Enrichment process group to it twice. Once for each invalid output port it contained (Warn logs and Info logs). Now the process group no longer reflects and invalid components in it. I am unable to see the screenshot you attached. I did start the "Log Generate" process group without making any changes to it and do see data being produced. I see data being queued in several places in the dataflow. If you are not seeing any data queued, check your NiFi's nifi-app.log for errors. Also check the various GenerateFlowFile processor to see if they are producing any bulletins (This icon will be displayed on the processor if it is: ) Floating over the bulletin will display a log message that may indicate the issue. Thanks, Matt
... View more
08-31-2016
03:15 PM
@Sami Ahmad Instead of left clicking on the link in step three, right click and select "Save Link As..." option to save the xml template so it can be imported in to your NiFi. The dataflow template will show you all the components needed for this workflow. I believe the intent of this tutorial was not to teach users how to use the NiFi UI, but rather how to use a combination of specific NiFi components to build accomplish a particular workflow. Using the NIFi UI dataflow tools you can recreate the workflow as a UI dataflow building exercise. Thanks, Matt
... View more
08-31-2016
01:28 PM
1 Kudo
@INDRANIL ROY
That is the exact approach I suggested in response to the above thread we had going. Each Node will only work on the FlowFile it has in its possession. By splitting this large TB file into many smaller files, you can distribute the processing load across your downstream cluster.
The distribution of FlowFiles via the RPG works as follows. The RPG communicates with the NCM of your NiFi cluster. The NCM returns back to the source RPG a list of available Nodes and there S2S ports in its cluster along with the current load on each. It is then the responsibility of the RPG to do smart load-balancing of the data in its incoming queue to these Nodes. Nodes with higher load will get fewer FlowFiles. The load balancing is done in batches for efficiency, so under light load you may not see an exact balanced delivery, but under higher FlowFile volumes you will see a balanced delivery over the 5 minutes delivery statistics. Thanks, Matt
... View more
08-31-2016
12:26 PM
NiFi 1.x was just officially released yesterday. HDF 2.x has not been released yet (look for it soon). @Jobin George article is still valid for the NiFi 0.x (HDF 1.x) versions. A new article should be written for the new versions.
... View more
08-31-2016
12:23 PM
3 Kudos
@David DN NiFi 1.x (HDF 2.x) versions have gone through a major framework upgrade/change. A multi-tenancy approach has been added that allows users to control the access of users down to the component level. As part of this change, the way the initial admin user is added has changed. In previous NiFi 0.x (HDF 1.x) versions, this was simply done by adding the DN of your first admin user to the authorized-users.xml file. In NiFi 1.x (HDF 2.x) versions you need to set that user DN in the following property in the authorizers.xml file:
<property name="Initial Admin Identity"></property> For those who previously worked with NiFi 0.x (HDF 1.x) versions, you can use an old authorized-users.xml file to seed the new NiFi version's user authorization by setting this property in the same file: <property name="Legacy Authorized Users File"></property> NiFi 1.x (HDF 2.x) version no longer provide new users the ability to "request access". An Admin will need to manually added each users and assign them component level access through the UI.
adding new users is done through the users UI found in the hamburger menu in the upper right corner of the UI. (Remember this can only be done once initial admin as given access as described above). From the Users UI, select the add user icon in the upper right corner : The above UI will appear to add your new users. Supply your kerberos, LDAP, or certificate DN and click "OK" Now that you have added a user you need grant them component level access back on the main NiFi UI. Select the component you which to control access to. In the below example we will select the root canvas: A new "Access Policies" Ui will appear where you need select the access policy you want to add the user to from the pull down menu: Once you select Policy, click on the add user icon in the upper right to grant access to one of the users added earlier. Thanks, Matt
... View more
08-31-2016
11:42 AM
1 Kudo
@boyer NiFi 0.x versions use a whole dataflow revision number when applying changes to anywhere on the canvas. In order invoke a change anywhere (does not matter if you working on different components or within different process groups) on the canvas, the user making the change will need the latest revision number. A user may open a component for editing at which time the current revision number is grabbed. At the same time another use in another browser may do the same. Whichever user makes there change and hits apply first will trigger the revision number to increment. When the second user tries to hit apply, you get the error you described because his change request does not have the current revision. But there is good news.... How this works has changed in NiFi 1.x (HDF 2.x) versions. Revisions are no longer tied to the entire dataflow. While two users will still be unable to make changes to the exact same component at the same time, they will be able to edit different components at the same time without running into the above issue. Thanks, Matt
... View more
08-30-2016
08:54 PM
@Saikrishna Tarapareddy
Just want to make sure I understand completely.
You can establish a connection from your local machine out to your remote NiFi; however, you cannot have yoru remote NiFi connect to your local machine. correct?
In this case you would install a NiFi instance on your local machine and the Remote Process Group (RPG) would be added to the canvas on that local NiFi instance. The NiFi instance running the RPG is acting as the client in the connection between NiFi instances. On your remote NiFi instance, your dataflow that is fetching files from your HDFS would need to route those files to an output port located on the root canvas level. (output and input ports allow FlowFiles to transfer from one level up in the dataflow. So at the root level they allow you to interface with another NiFi.)
For this transfer to work your local instance of NiFi will need to be able to communicate with the http(s) port of your remote NiFi instance (NCM http(s) port if remote is a NiFi cluster). Your local instance will also need to be able to communicate with the configured Site-To-Site (S2S) port on your remote instance (Need to be able to communicate with S2S port on every Node if remote is a NiFi cluster). nifi.properties file # Site to Site properties
nifi.remote.input.socket.host=<remote instance FQDN>
nifi.remote.input.socket.port=<S2S port number> The dataflow on your remote NiFi would look something like this: The dataflow on your local NiFi would look something like this: As you can see in this setup the local NiFi is establishing the connection to the remote NiFi and pulling the data from the output port "outLocal". Thanks,
Matt
... View more
08-29-2016
09:01 PM
@Saikrishna Tarapareddy Your Regex above says the CSV file content must start with Tagname,Timestamp,Value,Quality,QualityDetail,PercentGood
So, it should not route to "Header" unless the CSV starts with that. What is found later in the CSV file should not matter. I tried this and it seems to work as expected. If i removed the '^', then all files matched. Your processor is also loading 1 MB worth of the CSV content for evaluation; however, the string you are searching for is far fewer bytes. If you only want to match against the first line, reduce the size of the buffer from '1 MB' to maybe '60 b'. If I changed the buffer to '60 b' and removed the '^' from the regex above, only the files with the matching header were routed to "header".
Thanks, Matt
... View more
08-29-2016
06:47 PM
2 Kudos
@Saikrishna Tarapareddy The mergeContent processor is not designed to look at the content of the NiFi FlowFiles it is merging. What you will want to do first is use a RouteOnContent processor to route only those Flowfiles where Content contains the headers you want to merge. The 'unmatched' FlowFiles could then be routed elsewhere or auto-terminated.
Thanks, Matt
... View more
08-26-2016
12:00 PM
3 Kudos
@kishore sanchina NiFi only supports user controlled access when it is configured to run securely over HTTPS. The HTTPS configuration of NiFi will require a keystore and truststore is created/provided. If you don't have a corporately provided PKI infrastructure that can provide your with TLS certificates for this purpose, you can create your own. The following HCC article will walk you through manually creating your own: https://community.hortonworks.com/articles/17293/how-to-create-user-generated-keys-for-securing-nif.html Once your NiFi is setup securely, you will need to enable user access to the UI. There are two parts to successful access: 1. User authentication <-- This can accomplished via TLS certificates, LDAP, or Kerberos. Setting up NiFi to use one of these login identity providers is covered here: https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#user-authentication 2. User Authorization <-- This is accomplished through NiFi via the authorized-users.xml file. This process is documented here: https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#controlling-levels-of-access You will need to manually populate the Authorized-users.xml file with your first "Admin" role user. That Admin user will be able to approve access to other users who have passed the authentication phase and submitted a UI based authorization request. Thanks, Matt
... View more
08-25-2016
08:41 PM
1 Kudo
@INDRANIL ROY
NiFi does not distribute processing of a single file across multiple Nodes in a NiFi cluster. Each Node works on its own set of files. The Nodes themselves are not even aware other nodes exist. They work on what files they have and report their health and status back to the NiFI Cluster Manager (NCM). 1. What format is this file in? 2. What kind of processing are you trying to do against this files content? 3. Can the file be split in to numerous smaller files (Depending on the file content, NiFi may be able to do the splitting)? As an example: A common dataflow involves processing very large log files. The large log file is processed by the SplitText processor to produce many smaller files. These smaller files are then distributed across a cluster of NiFi nodes where the remainder of the processing is performed. There are a variety of pre-existing "split" type processors. Thanks, Matt
... View more
08-25-2016
02:55 PM
4 Kudos
@kishore sanchina The simplest answer to your question is to use the ListFile processor to produce a list of the files from your local filesystem, feed that to a fetchFile processor that will pickup the content and then pass them to a PutHDFS processor to send them to your HDFS. The listFile processor will maintain state based on lastModified time on the files to ensure the files are not listed more then once. If you right click on either of these NiFi processors you can select "usage" from the displayed context menu to get more details on the configuration of each of these. Thanks, Matt
... View more
08-25-2016
02:00 PM
@INDRANIL ROY The massive size of your file, ListSFTP/FetchSFTP may not be the best approach. Let me ask a few questions: 1. Are you picking up numerous files of this multi-TB size or are we talking about a single file? 2. Are you trying to send the same TB file to every Node in your cluster or is each node going to receive a completely different file? 3. Is the directory where these files are originally consumed from a local disk or a network mounted disk?
... View more
08-24-2016
03:43 PM
1 Kudo
Just to clarify on how S2S works when communicating with a target NiFi cluster. The NCM never receives any data so it cannot act as the load-balancer. When the source NiFi communicates with the NCM, the NCM returns a list of all currently connected nodes and there S2S ports along with the current load on each node to the source NiFi. It is then the job of the source NiFi RPG to use that information to do a smart load-balanced delivery of data to those nodes.
... View more
08-24-2016
03:04 PM
Anything you can do via the browser can be done my making calls to the NiFi-API. You could either setup an external process to run a couple curl commands to start and they stop the GetTwitter processor in your flow or you could us a couple invokeHTTP processors in your dataflow (configured using the cron scheduling strategy) to start and stop the GetTwitter processor on a given schedule. Matt
... View more
08-24-2016
02:14 PM
1 Kudo
@INDRANIL ROY What you describe is a very common dataflow design. I have a couple question for clarity. RPG (Remote Process Group) do not send to other RPGs. RPG send and pull data from input and output ports located on other NiFi instances. I suspect your standalone instance has the RPG and it is sending FlowFiles to input port(s) on the destination NiFi cluster. In this particular case the load-balancing of data is being handled by the RPG. For network efficiency data is distributed in batches, so you may not see with light dataflows and exact same number of FlowFiles going to each Node. Also the Load-balancing has logic built in to it so that Node in the target cluster who have less work load get more FlowFiles. Although the URL provided to the RPG is the URL for the target BNiFi cluster's NCM, the FlowFiles are not sent to the NCM, but rather sent directly to the connected nodes in the target cluster. Every Node in a NiFi cluster operates independently of one another working only on the FlowFiles it possesses. Nodes do not communicate with one another. They simply report their health and status back to the NCM. It is information from those health and status heartbeats that is sent back to the source RPG and used by that RPG to do the smart data delivery. In order to distribute the fetching of the source data, the source directory would need to be reachable by all nodes in the target NiFi cluster. In the case of ListFile/FetchFile, the directory would need to mounted identically to all systems. Another option would be to switch to a listSFTP/FetchSFTP setup. In this setup you would not even need your standalone NiFi install. You could simply add a listSFTP processor to your cluster (configured to run "on primary node"). Then take the success from that listing and feed it to an RPG that points back at the clusters NCM URL. An input port would be used to receive the now load-balanced FlowFiles. Feed the success from that input port to the FetchSFTP processor and now you have all nodes in your cluster retrieving the actual content. So as you can see from the above the listSFTP would only run on one node (Primary Node) producing no content FlowFiles. The RPG would smartly distribute those FlowFile across all connected nodes where the FetchSFTP on each Node would retrieve the actual content. The same flow above could be done with listFile and FetchFile as well, just mount the same source directory to every node and follow the same model. Matt
... View more
08-24-2016
12:53 PM
Most threads are very short running (miliseconds) and since the NiFi refresh rate defaults are every 30 seconds. The number in the upper right corner may not represent a still running thread. In your screenshot above, the TailFile processor shows as having recorded the completion of 473,336 Tasks (Each task using a thread to complete) and a total cumulative thread time of only 2 min, 52 seconds and 334 milliseconds over the past 5 minutes. Long running threads will show much different stats in the Tasks/Time field.
... View more
08-17-2016
08:33 PM
1 Kudo
@Hans Feldmann The individual processors allow for concurrent task changes. By default they all have one concurrent task. For each additional concurrent task, you are giving that processor the opportunity to request an additional thread from the NiFi controller to do work in parallel. (Think of it as two copies of the same processor doing working different files or batches of files). If there isn't sufficient files in the incoming queue then any additional concurrent tasks are not utilized. The flip side is if you allocate two many concurrent tasks to a single processor, that processor may itself end up using two many threads from the NiFi controller's resource pull resulting in a thread starvation to other processors. So star with the default and setup by one increment at a time in place of backlog in your flow. The NiFi controller also has a setting that limits the maximum number of threads it can use from the underlying hardware. This is the other thing Andrew was mentioning. A restart of NiFi is NOT needed when you make changes to these values. The defaults are low (10 timer driven and 5 event driven). I would set the timer driven to no more then double the number of cores your hardware has. Thanks, Matt
... View more
08-15-2016
05:15 PM
@Yogesh Sharma You are seeing duplicate data because the run schedule on your invokeHTTP processor is set to 1 sec and the data you are pulling is not updated that often.
You can build in to your flow the ability to detect duplicates (even across a NiFi cluster). In order to do this you will need the following things setup: 1. DistributedMapCacheServer (Add this controller service to "Cluster Manager" if clustered. If standalone it still needs to be added. This is configured with a listening port) 2. DistributedMap CacheClientService (Add this controller service to "Node" if clustered. If standalone it still needs to be added. This is configured with teh FQDN of the NCM running the above Cache Server.) 3. Start the above controller services. 4. Add a HashContent and DetectDuplicate processors to your flow between your invokeHTTP processor and the SplitJson processors.
I have attached a modified version of your template.
eqdataus-detectduplicates.xml If you still see duplicates, adjust the configured age off duration in the DetectDuplicate processor.
Thanks, Matt
... View more
08-15-2016
12:40 PM
2 Kudos
@Obaid Salikeen Not sure what "issues" you had when you tried to add a new node to your existing cluster. The components (processors, connections, etc...) of an existing cluster can be running when you add new additional nodes to it. The new nodes will inherit the flow and templates from the NCM as well as the current running state of those components when it joins.
But, in order for a node to successfully join a cluster the following must be true:
1. The new node either has no flow.xml.gz file and templates directory or the flow.xml.gz file and templates do not match what is currently on the NCM. (Remove flow.xml.gz file and templates dir from new node and restart node) The nifi-app.log will indicate in if a difference was found. 2. The nifi.sensitive.props.key= in the nifi.properties file must have the same value as on the NCM. 3. The NCM must be able to resolve the URL to the new node. If the nifi.web.http(s).host= was left blank on your new node, Java on that node may be reporting the hostname as localhost. Make sure valid resolvable hostnames are supplied for nifi.web.http.host=, nifi.cluster.node.address=, and nifi.cluster.node.unicast.manager.address=. 4. Both NCM and Node security protocol must match. nifi.cluster.protocol.is.secure= in nifi.properties file. 5. Firewalls must be open between NCM and Node on both HTTP(s) port and node and NCM ports. 6. New node must have all the same available java classes. If custom processors exist in your flow make sure the new node also has those custom nar/jar files included in its lib dir. Thanks, Matt
... View more
08-05-2016
09:20 PM
1 Kudo
@Saikrishna Tarapareddy
Where all 74 files in the input queue before the MergeContent was run?
The mergeContent processor just like the other processors works on a run schedule. My guess is that last file was not in the queue at the moment the MergeContent processor ran, so you only saw 13 get bundled instead of 14. With a min of 4 entries, it will read what is on the queue and bin it. You likely ended up with 3 bins with 20 and 1 bin with 13 because at the moment it looked at the queue 73 or 13 FlowFiles is all it saw.
You can confirm this by stopping the MergeContent and allowing all 74 files to queue before staring it. The behavior should then be as you suspect.
Sounds like it is not important to have exactly 20 per merged file. Perhaps you can set a max bin age so that files don't get stuck.
Something else you can do is adjust the run schedule so the mergeContent does not run as often. The default is "o sec" which means run as fast as possible. Try changing that to somewhere between 1 and 10 sec to give the files a chance to queue. If you are picking up all the 74 files at the same time, we are likely talking milliseconds here that is causing this last file to get missed. Thanks, Matt
... View more
08-05-2016
02:48 PM
The attached images do not really show us your complete configuration. Can you generate a template of your flow through the NiFi UI and share that? You create a template by highlighting/selecting all components you want to include in your template and then click on the "create template" icon in the upper center of the UI. After the template has been created you can export it out of your NiFi from the template management UI icon (upper right corner of UI). Then attach that exported xml template here.
... View more
08-05-2016
01:50 PM
With a NiFi cluster, every node in that cluster runs the exact same dataflow. Some data ingest type processors are not ideally suited for this as they may complete or pull the same data in to each cluster node. In cases like this it is better to set the scheduling strategy on these processor to "On primary Node" so that the processor only runs on one node (primary node).
You can then use dataflow design strategies like RPGs (NiFi Site-to-Site) to redistribute the received data across all your NiFi cluster nodes for processing.
... View more