Member since
07-30-2019
2455
Posts
1285
Kudos Received
692
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
164 | 02-02-2023 12:32 PM | |
48 | 02-02-2023 12:06 PM | |
274 | 01-17-2023 12:55 PM | |
105 | 01-12-2023 01:30 PM | |
117 | 01-12-2023 01:08 PM |
08-24-2016
03:43 PM
1 Kudo
Just to clarify on how S2S works when communicating with a target NiFi cluster. The NCM never receives any data so it cannot act as the load-balancer. When the source NiFi communicates with the NCM, the NCM returns a list of all currently connected nodes and there S2S ports along with the current load on each node to the source NiFi. It is then the job of the source NiFi RPG to use that information to do a smart load-balanced delivery of data to those nodes.
... View more
08-24-2016
03:04 PM
Anything you can do via the browser can be done my making calls to the NiFi-API. You could either setup an external process to run a couple curl commands to start and they stop the GetTwitter processor in your flow or you could us a couple invokeHTTP processors in your dataflow (configured using the cron scheduling strategy) to start and stop the GetTwitter processor on a given schedule. Matt
... View more
08-24-2016
02:14 PM
1 Kudo
@INDRANIL ROY What you describe is a very common dataflow design. I have a couple question for clarity. RPG (Remote Process Group) do not send to other RPGs. RPG send and pull data from input and output ports located on other NiFi instances. I suspect your standalone instance has the RPG and it is sending FlowFiles to input port(s) on the destination NiFi cluster. In this particular case the load-balancing of data is being handled by the RPG. For network efficiency data is distributed in batches, so you may not see with light dataflows and exact same number of FlowFiles going to each Node. Also the Load-balancing has logic built in to it so that Node in the target cluster who have less work load get more FlowFiles. Although the URL provided to the RPG is the URL for the target BNiFi cluster's NCM, the FlowFiles are not sent to the NCM, but rather sent directly to the connected nodes in the target cluster. Every Node in a NiFi cluster operates independently of one another working only on the FlowFiles it possesses. Nodes do not communicate with one another. They simply report their health and status back to the NCM. It is information from those health and status heartbeats that is sent back to the source RPG and used by that RPG to do the smart data delivery. In order to distribute the fetching of the source data, the source directory would need to be reachable by all nodes in the target NiFi cluster. In the case of ListFile/FetchFile, the directory would need to mounted identically to all systems. Another option would be to switch to a listSFTP/FetchSFTP setup. In this setup you would not even need your standalone NiFi install. You could simply add a listSFTP processor to your cluster (configured to run "on primary node"). Then take the success from that listing and feed it to an RPG that points back at the clusters NCM URL. An input port would be used to receive the now load-balanced FlowFiles. Feed the success from that input port to the FetchSFTP processor and now you have all nodes in your cluster retrieving the actual content. So as you can see from the above the listSFTP would only run on one node (Primary Node) producing no content FlowFiles. The RPG would smartly distribute those FlowFile across all connected nodes where the FetchSFTP on each Node would retrieve the actual content. The same flow above could be done with listFile and FetchFile as well, just mount the same source directory to every node and follow the same model. Matt
... View more
08-24-2016
12:53 PM
Most threads are very short running (miliseconds) and since the NiFi refresh rate defaults are every 30 seconds. The number in the upper right corner may not represent a still running thread. In your screenshot above, the TailFile processor shows as having recorded the completion of 473,336 Tasks (Each task using a thread to complete) and a total cumulative thread time of only 2 min, 52 seconds and 334 milliseconds over the past 5 minutes. Long running threads will show much different stats in the Tasks/Time field.
... View more
08-17-2016
08:33 PM
1 Kudo
@Hans Feldmann The individual processors allow for concurrent task changes. By default they all have one concurrent task. For each additional concurrent task, you are giving that processor the opportunity to request an additional thread from the NiFi controller to do work in parallel. (Think of it as two copies of the same processor doing working different files or batches of files). If there isn't sufficient files in the incoming queue then any additional concurrent tasks are not utilized. The flip side is if you allocate two many concurrent tasks to a single processor, that processor may itself end up using two many threads from the NiFi controller's resource pull resulting in a thread starvation to other processors. So star with the default and setup by one increment at a time in place of backlog in your flow. The NiFi controller also has a setting that limits the maximum number of threads it can use from the underlying hardware. This is the other thing Andrew was mentioning. A restart of NiFi is NOT needed when you make changes to these values. The defaults are low (10 timer driven and 5 event driven). I would set the timer driven to no more then double the number of cores your hardware has. Thanks, Matt
... View more
08-15-2016
05:15 PM
@Yogesh Sharma You are seeing duplicate data because the run schedule on your invokeHTTP processor is set to 1 sec and the data you are pulling is not updated that often.
You can build in to your flow the ability to detect duplicates (even across a NiFi cluster). In order to do this you will need the following things setup: 1. DistributedMapCacheServer (Add this controller service to "Cluster Manager" if clustered. If standalone it still needs to be added. This is configured with a listening port) 2. DistributedMap CacheClientService (Add this controller service to "Node" if clustered. If standalone it still needs to be added. This is configured with teh FQDN of the NCM running the above Cache Server.) 3. Start the above controller services. 4. Add a HashContent and DetectDuplicate processors to your flow between your invokeHTTP processor and the SplitJson processors.
I have attached a modified version of your template.
eqdataus-detectduplicates.xml If you still see duplicates, adjust the configured age off duration in the DetectDuplicate processor.
Thanks, Matt
... View more
08-15-2016
12:40 PM
2 Kudos
@Obaid Salikeen Not sure what "issues" you had when you tried to add a new node to your existing cluster. The components (processors, connections, etc...) of an existing cluster can be running when you add new additional nodes to it. The new nodes will inherit the flow and templates from the NCM as well as the current running state of those components when it joins.
But, in order for a node to successfully join a cluster the following must be true:
1. The new node either has no flow.xml.gz file and templates directory or the flow.xml.gz file and templates do not match what is currently on the NCM. (Remove flow.xml.gz file and templates dir from new node and restart node) The nifi-app.log will indicate in if a difference was found. 2. The nifi.sensitive.props.key= in the nifi.properties file must have the same value as on the NCM. 3. The NCM must be able to resolve the URL to the new node. If the nifi.web.http(s).host= was left blank on your new node, Java on that node may be reporting the hostname as localhost. Make sure valid resolvable hostnames are supplied for nifi.web.http.host=, nifi.cluster.node.address=, and nifi.cluster.node.unicast.manager.address=. 4. Both NCM and Node security protocol must match. nifi.cluster.protocol.is.secure= in nifi.properties file. 5. Firewalls must be open between NCM and Node on both HTTP(s) port and node and NCM ports. 6. New node must have all the same available java classes. If custom processors exist in your flow make sure the new node also has those custom nar/jar files included in its lib dir. Thanks, Matt
... View more
08-05-2016
09:20 PM
1 Kudo
@Saikrishna Tarapareddy
Where all 74 files in the input queue before the MergeContent was run?
The mergeContent processor just like the other processors works on a run schedule. My guess is that last file was not in the queue at the moment the MergeContent processor ran, so you only saw 13 get bundled instead of 14. With a min of 4 entries, it will read what is on the queue and bin it. You likely ended up with 3 bins with 20 and 1 bin with 13 because at the moment it looked at the queue 73 or 13 FlowFiles is all it saw.
You can confirm this by stopping the MergeContent and allowing all 74 files to queue before staring it. The behavior should then be as you suspect.
Sounds like it is not important to have exactly 20 per merged file. Perhaps you can set a max bin age so that files don't get stuck.
Something else you can do is adjust the run schedule so the mergeContent does not run as often. The default is "o sec" which means run as fast as possible. Try changing that to somewhere between 1 and 10 sec to give the files a chance to queue. If you are picking up all the 74 files at the same time, we are likely talking milliseconds here that is causing this last file to get missed. Thanks, Matt
... View more
08-05-2016
02:48 PM
The attached images do not really show us your complete configuration. Can you generate a template of your flow through the NiFi UI and share that? You create a template by highlighting/selecting all components you want to include in your template and then click on the "create template" icon in the upper center of the UI. After the template has been created you can export it out of your NiFi from the template management UI icon (upper right corner of UI). Then attach that exported xml template here.
... View more
08-05-2016
01:50 PM
With a NiFi cluster, every node in that cluster runs the exact same dataflow. Some data ingest type processors are not ideally suited for this as they may complete or pull the same data in to each cluster node. In cases like this it is better to set the scheduling strategy on these processor to "On primary Node" so that the processor only runs on one node (primary node).
You can then use dataflow design strategies like RPGs (NiFi Site-to-Site) to redistribute the received data across all your NiFi cluster nodes for processing.
... View more
08-05-2016
11:57 AM
1 Kudo
@Yogesh Sharma Is your NiFi a cluster or Standalone instance of NiFi? If it is a cluster, it could explain why you are seeing duplicates since the same GetTwitter processor would be running on every Node. Matt
... View more
08-03-2016
10:55 AM
5 Kudos
@Ankit Jain
When A NiFi instance is designated as a node its starts sending out heartbeat messages after it is started. Those heartbeat messages contain important connection information for the Node. Part of that messages is the hostname for each connecting node. If left blank Java will try to determine the hostname and in many cases the hostname ends up being "localhost". This may explain why the same configs worked when all instances where on the same machine.
Make sure that all of the following properties have been set on everyone of your Nodes:
# Site to Site properties
nifi.remote.input.socket.host= <-- Set to the FQDN for the Node musty be resolvable by all other instances.
nifi.remote.input.socket.port= <-- Set to unused port on Node.
# web properties #
nifi.web.http.host= <-- set to resolvable FQDN for Node
nifi.web.http.port= <-- Set to unused port on Node
# cluster node properties (only configure for cluster nodes) #
nifi.cluster.is.node=true
nifi.cluster.node.address= <-- set to resolvable FQDN for Node
nifi.cluster.node.protocol.port= <-- Set to unused port on Node
nifi.cluster.node.protocol.threads=2
# if multicast is not used, nifi.cluster.node.unicast.xxx must have same values as nifi.cluster.manager.xxx #
nifi.cluster.node.unicast.manager.address= <-- Set to the resolvable FQDN of your NCM
nifi.cluster.node.unicast.manager.protocol.port= <-- must be set to Manager protocol port assigned on your NCM.
Your NCM will need to be configured the same way as above for the Site-to-Site properties and Web properties, but instead of the "Cluster Node properties", you will need to fill out the "cluster manager properties": # cluster manager properties (only configure for cluster manager) #
nifi.cluster.is.manager=true
nifi.cluster.manager.address= <-- set to resolvable FQDN for NCMnifi.cluster.manager.protocol.port= <-- Set to unused port on NCM. The most likely cause of your issue is not having the host/address fields populated or trying to use a port that is already in use on the server.
If setting the above does not resolve your issue, try setting DEBUG for the cluster logging in the logback.xml on one of your nodes and the NCM to get more details: <logger name="org.apache.nifi.cluster" level="DEBUG"/>
... View more
08-02-2016
05:48 PM
3 Kudos
@Obaid Salikeen Try using \\n (double backslash) or using 'Shift + enter" in the expression language editor box to create new lines in your replacement string as shown by Joe Witt above.
Thanks, Matt
... View more
07-22-2016
12:26 PM
@Manikandan Durairaj
Simon is completely correct above; however, I want to add a little to his statement about saving the entire flow.xml.gz (Standalone or NiFI Cluster Node) file or flow.tar (NiFi Cluster NCM) file.
When you generate templates in NiFi, those dataflows are scrubbed of all encrypted values (passwords). When importing those templates in to another NiFi, the user will need to repopulate all the processor and controller tasks passwords manually.
Saving off the flow.xml.gz or flow.tar file will capture the entire flow exactly as it is, encrypted sensitive passwords and all. NiFi will not start if it cannot decrypt these encrypted sensitive properties contained in the flow.xml. When sensitive properties (passwords) are added they are encrypted using these settings from your nifi.properties file:
# security properties #
nifi.sensitive.props.key=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.provider=BC In order to drop your entire flow.xml.gz or flow.tar onto another clean NiFi, these values must all match exactly.
Thanks, Matt
... View more
07-18-2016
10:39 PM
1 Kudo
@gkeys What are the permissions on both the file(s) you are trying to pickup with the GetFile processor and the permissions on the directory the file(s) live in? -rwxrwxrwx 1 nifi dataflow 24B Jul 18 18:20 testfile and drwxr-xr-- 3 root dataflow 102B Jul 18 18:20 testdata With the above example permission, I reproduce exactly what you are seeing. If "Keep Source File" is set to true, NiFi creates a new flowfile with the content of the file. If "Keep Source File" is set to false, NiFi GetFile yields because it does not have the necessary permissions to delete the file from the directory. This is because the write bit is required on the source directory for the user who is trying to delete the file(s). In my example nifi is running as user nifi, so he can read the files in the root owned testdata directory because the directory group ownership is dataflow just like my nifi user and the dir has r-x permissions. fi i change that dir permissions to rwx then my nifi user will also be able to delete the testfile. Thanks,
Matt
... View more
07-18-2016
10:09 PM
1 Kudo
You could also modify the local /etc/hosts file on your ec2 instances so that the hostname "ip-10-40-197.ec2.internal" resolves to the proper external IP addresses for those zk nodes if they have them.
... View more
07-18-2016
02:52 PM
2 Kudos
NiFi secure cluster and Site-To-Site authentication is not handled by kerberos. NiFi kerberos authentication is only supported for user authentication. Secure NiFi Site-To-Site communications are still handled using TLS mutual authentication.
The error you are seeing is because that TLS mutual auth is failing. The URL you are providing the Remote Process Group (RPG) is using the IP of the target NCM. The NCM is providing its public key to your nodes for autentication and that certificate does not contain the IP as its DN or as a Subject Alternative Name (SAN). So the source NiFi is saying the that the provided certificate shoudl contain 10.110.20.213 but instead it is providing something else.
If you do a verbose listing on your keystore on the NCM you will see the contents of the key. Look for CN=<some value> (This value is typically the hostname/FQDN.) Use that value in the URL you are providing your RPG. Make sure your source NiFi (In your case every Node in your NiFi cluster) can resolve that hostname to its proper IP.
The other option is to get a new certificate that has the IP added to it as a SAN.
Thanks, Matt
... View more
07-13-2016
12:33 PM
1 Kudo
i recommend setting up a NiFi cluster that will spread the load across multiple resources. This removes that single point of failure caused by only having one ec2 instance running a NiFI. Now whether a single ec2 instance with NiFi can run your dataflows really depends a lot on your data and what your specific dataflows looks like.for example are you doing a lot of CPU or memory intensive processing in your NiFi dataflows? A good approach is having NiFi sitting on edge systems feeding a central NiFi processing cluster.
... View more
07-08-2016
12:07 PM
1 Kudo
@mliem
NiFi components (Processors, RPGs, input/output ports, etc...) are designed to run asynchronous. There is no mechanism built in to NiFi for triggering one processor to run as a result of another processor completing its job.
That being said, everything you can do via the UI can be done as well through calls directly to the NiFi API. You may consider playing around with the capability using the invokeHTTP processor to make calls to the NiFi API to start and stop specific processor at specific points in your dataflow. Once a processor is started it will run retrieving a thread from the controller to do so. Stopping That processor will not kill that thread, the processor will simply not be scheduled to run again and will be in a state of "stopping" during that time frame.. You can not start a processor that is still "stopping". So you want to be careful where you invoke your start and stop actions. (For example, following your "matched" criteria you start the mergeContent and after the mergeContent you invoke the stop of the mergeContent.)
For speed and efficiency's sake, I would look for ways to keep your flow asychronous in design. If you do choose to go this route, I would also build some monitoring into your flow using the monitorActivity processor. This processor can be used to monitor that data continues to flow based upon some configured threshold. If that threshold is exceeded it generates a FlowFile that can be routed to a putEmail processor (as and example) to alert someone that the dataflow is down. This is a safety net so to speak in the event one of your api calls fails for some reason (Network hicup for example). Thanks, Matt
... View more
07-07-2016
04:28 PM
It may be helpful to understand your dataflow better if you can paste a screenshot of the second dataflow you want to alter.
... View more
07-01-2016
12:06 PM
NiFi 1.0 is deep in to development right now. Expect to see it up for vote in August. NiFi 1.0 has considerable re-work done across the board. (New UI, No more NCM for clustering, etc...) Very exciting stuff.
... View more
06-30-2016
03:14 PM
6 Kudos
@Alexander Aolaritei NiFi can produce a lot of provenance data. The solution you are looking for will be coming in Apache NiFi 1.0 in the form of a NiFi reporting Task. This "SiteToSiteProvenanceReportingTask" will use the NiFi Site-to-Site (S2S) protocol to send provenance events to another NiFi instance in configurable batches. Of course that target NIfI instance could be yourself; however, that would just produce even more provenance events locally as you handle those messages. So It may be wise to standup another NiFi instance just for Provenance event handling. Upon receiving those provenance events via a S2S input port, you can use standard NiFi processors to split/merge them, route them, and store them in your desired end point (Whether that is local file(s), external DB, etc...). I am not a developer so cannot help with the custom solution you are working on, but just want to share what is coming as another viable solution to your needs. Thanks, Matt
... View more
06-28-2016
08:27 PM
1 Kudo
@AnjiReddy Anumolu Let me start off by making sure I fully understand the dataflow you have created to better answer your question. You have added a getFile processor to your flow which will pickup file(s) from a local file system directory and then sends them via the success relationship to a logAttribute processor. What did you do with the logAttributes's success relationship? If it is auto-terminated, you are essentially telling NiFi you are done with the files following a successful logging of the file(s) FlowFile attributes/metadata. If the success relationship has not been defined the processor will remain invalid and cannot be run. In this case the file(s) picked up by the getFile processor will remain queued on the connection between the getFile processor and the logAttribute processor. In either case, when NiFi ingests file(s) they are placed in the NiFi content repository. The location of the content repository is defined/configured in the nifi.properties file. The default places them in a directory created within the default NiFi installation directory: nifi.content.repository.directory.default=./content_repository NiFi stores file(s) in what are known as claims to make most efficient use of the system's hard disks. A claim can contain 1 to many files. The default claim configuration is also defined/configured in the nifi.properties file. The default configuration is as follows: nifi.content.claim.max.appendable.size=10 MB
nifi.content.claim.max.flow.files=100 For files smaller then 10 MB they may be stored with other files with up to 100 total files in a single claim. If a file is larger then 10 MB it will end up in a claim of one. At the same time files are written to a claim, FlowFile attributes/metadata is written about the ingested files in the flowfile repository. The location of the flowfile repository is also defined/configured in the nifi.properties file: nifi.flowfile.repository.directory=./flowfile_repository These FlowFile attributes/metadata will contain information such as filename, filesize, location of claim in content repository, claim offset, etc... The claim offset is the starting byte location of a particular file's content within a claim. The fileSize defines the number of bytes from that offset that makes up the compete data. The nifi-app.log contains fairly robust logging by default (configured in logback.xml file). When NiFi ingest files, NiFi will log that and that log line will contain information about the claim (location and offset). When NiFi auto-terminates FlowFiles they are removed from the content repository. Depending on the content repository archive setup, the file(s) may be archived for a period of time. In the case of archived file(s), it can be replayed using the provenance NiFi UI. Thanks, Matt
... View more
06-23-2016
09:41 PM
Was your VM restarted or the NiFi restarted since HDP was installed?
... View more
06-08-2016
12:19 PM
Glad I could help and good to hear you are now up and running.
... View more
06-03-2016
05:33 PM
You can edit files as root. Editing files does not change ownership. You just need to make sure at the end of editing all files are owned by the user who will be running yoUR NiFi instances.
Give yourself a fresh start and delete the flow.tar on your NCM and flow.xml.gz and templates dir on your Node. So at the end of configuring your two NiFi installs (one install configured to be NCM and one separate install configured to be a Node), you started your NCM successfully? Looking in the nifi-app.log for your NCM, do you see the following lines: 2016-06-03 ... INFO [main] org.apache.nifi.web.server.JettyServer NiFi has started. The UI is available at the following URLs:
2016-06-03 ... INFO [main] org.apache.nifi.web.server.JettyServer https://Bxxxxx.xxxxxx.com:8080/nifi You then go to your other NiFi installation configured as your Node and start it.
After it has started successfully it will start attempting to send heartbeats to Bxxxxx.xxxxxxx.com on port 1xxx. You should see these incoming heartbeats logged in the nifi-app.log on your NCM. Do you see these? INFO [Process NCM Request-1] o.a.n.c.p.impl.SocketProtocolListener Received request 411684b2-25cb-461f-978e-fb3bda6a7ef0 from Axxxxx.xxxxxx.com INFO [Process NCM Request-1] o.a.n.c.manager.impl.WebClusterManager Node Event: (......) 'Connection requested from new node. Setting status to connecting.' After that the NCM will either mark the node as connected or given a reason for not allowing it to connect
If you are not seeing these heartbeats in the NCM nifi-app.log, then something is blocking the TCP traffic on the specified port. I did notice in the above example you provided 1xxx as your cluster manger port. Is that port above 1024? Ports <= 1024 are reserved and can't be used by non root users. If you are running your NCM as a user other then root (as it sounds by the above) NiFi will fail to bind to that port for listening for these heartbeats. Matt
... View more
06-03-2016
04:13 PM
1 Kudo
A fresh install of NiFi has no flow.xml.gz file until after it is started for the first time.
Are these fresh NiFi installs or installations that were previously run standalone? - if that is the case you can't simply tell them they are nodes and NCMs and expect it to work. Your NCM does not run with a flow.xml.gz like your nodes and standalone instances do. The NCM uses a flow.tar file. The flow.tar would be created on startup and contain an empty flow.xml. When you started your Node (with existing flow.xml.gz file) it would have communicated with NCM but been rejected because the flow on the node would not have matched what was on the NCM. If you are looking to migrate form a standalone instance to a cluster, I would suggest reading this:
https://community.hortonworks.com/content/kbentry/9203/how-to-migrate-a-standalone-nifi-into-a-nifi-clust.html Let me make sure understand your environment:
1. you have two different installation of NiFi. 2. One installation of NiFi is setup and configured to be a non-secure (http) NCM 3. One instance of NiFi is setup and configured to be a non-secure (http) Node. 4. The # cluster common properties (cluster manager and nodes must have same values) # section in the nifi.properties files on both NCM and Node(s) are configured identical 5. In that section on both nifi.cluster.protocol.is.secure=false is configured as false (Cannot be true if running http.) 6. The # cluster node properties (only configure for cluster nodes) # has been configured only on your node. - The following properties in the above node section are configured nifi.cluster.is.node=true nifi.cluster.node.unicast.manager.address= nifi.cluster.node.unicast.manager.protocol.port= and the port matched what you configured in the next section in your NCM. 8. The # cluster manager properties (only configure for cluster manager) # section has been configured on your NCM only. - nifi.cluster.is.manager=true Thanks, Matt
... View more
06-03-2016
03:38 PM
Are these https or http configured cluster NCM and Node(s)?
NCM needs to be able to communicate with the http(s) port and node.protocol port configured in the nifi .properties file on the Node(s).
Node needs to be able to communicate with the cluster manager protocol port configured in the nifi.properties file on the NCM.
Thanks, Matt
... View more
06-02-2016
01:18 PM
1 Kudo
There are a few things you can do here if i am understanding correctly what you are trying to accomplish. 1. The logback.xml can be modified so specific processor component logs could be redirected to a specific new log file. You can specify where that new log is written. You could also specify the specific log level of those components (WARN level would get you just WARN and ERROR messages).
2. In your dataflow you could use the TailFile processor to monitor that new log and route any generated FlowFiles to a putEmail processor to send them to your Admin. In addition to email you can route those FlowFiles to a processor of your choice to put a copy to a specific location as well either locally or remotely. Thanks, Matt
... View more
05-31-2016
02:58 PM
Ahmad, The line you are seeing in the nifi-bootstrap.log indicates the JVM started successfully. You need to check the nifi-app.log to make sure the application loaded successfully. In the nifi-app.log you will find the following lines if the application successfully loaded:
2016-05-31 10:46:44,347 INFO [main] org.apache.nifi.web.server.JettyServer NiFi has started. The UI is available at the following URLs: 2016-05-31 10:46:44,347 INFO [main] org.apache.nifi.web.server.JettyServer http://<someaddress or FQDN>:8088/nifi Verify that the hostname or IP displayed on this line is reachable/resolvable on the system you are running your web browser from.
Thanks, Matt
... View more