About MattWho

MattWho · ‎09-06-2016

@INDRANIL ROY The output from the SplitText and RouteText processors is a bunch of FlowFiles all with the same filename (filename of the original FlowFile they were derived from.) NiFi differentiates these FlowFiles by assigning each a Unique Identifier (UUID). The problem you have is then writing to HDFS only the first FlowFile written with a particular filename is successful. All others result in the error you are seeing. The MergeContent processor you added reduces the impact but does not solve your problem. Remember that nodes do not talk to one another or share files with one another. So each MergeContent is working on its own set of files all derived from the same original source file and each node is producing its own merged file with the same filename. The first node to successfully write its file HDFS wins and the other nodes throw the error you are seeing. What is typically done here is to add an UpdateAttribute processor after each of your MergeContent processor to force a unique name on each of the FlowFiles before writing to HDFS. The uuid that NiFi assigns to each of these FlowFiles is often prepended or appended to the filename to solve this problem: If you do not want to merge the FlowFiles, you can simply just add the UpdateAttribute processor in its place. YOu will just end up with a larger number of files written to HDFS. Thanks, Matt

MattWho · ‎09-06-2016

@INDRANIL ROY Your approach above looks good except you really want to split that large 50,000,000 line file in to many more smaller files. Your example shows you only splitting it in to 10 files which may not ensure good file distribution to the downstream NiFi cluster nodes. The RPG load balances batches of files (up to 100 at a time) for speed and efficiency purposes. With so few files it is likely that every file will still end up on the same downstream node instead of load balanced. However if you were to split the source file in to ~5,000 files, you would achieve much better load-balancing. Thanks, Matt

MattWho · ‎09-06-2016

Thank you for the clarification on my post.

MattWho · ‎09-06-2016

@INDRANIL ROY You have a couple things going on here that are affecting your performance. Based on previous HCC discussions you have a single 50,000,000 line file you are splitting in to 10 files (Each 5,000,000 lines) and then distributing those splits to your NiFi cluster via a RPG (Site-to-Site). You are then using the RouteText processor to read every line of these 5,000,000 line files and route the lines based on two conditions. 1. Most NiFi processors (including RouteText) are multi-thread capable by adding additional concurrent tasks. A single concurrent task can work on a single file or batch of files. Multiple threads will not work on the same file. So by setting your current tasks to 10 on the RouteText you may not actually be using 10. The NiFi controller also has a max number of threads configuration that limits the number of threads available across all components. The max thread setting can be found by clicking on this icon in the upper right corner of the UI. Most components by default use timer driven threads, so this is the number you will want to increase in most cases. Now keep in mind that your hardware also limits how much "work" you can do concurrently. With only 4 cores, you are fairly limited. You may want to up this value from the default 10 to perhaps 20. You can just end up with a lot of threads in cpu wait. Avoid getting carried away on your thread allocations (Both at the controller level and processor level). 2. In oder to get better multi-thread throughput on your RouteText processor, try splitting your incoming fie in to many smaller files. Try splitting your 50,000,000 line file in to files with no more then 10,000 lines each. The resulting 5,000 files will be better distributed across your NiFi cluster Nodes and allow the multiple threads to be utilized. Thanks, Matt

MattWho · ‎09-06-2016

@Bojan Kostic It is not currently possible to add new jars /nars to a running NiFi. A restart is always required to get these newly added items loaded. Upon NiFi startup all the jars/nars are unpacked in to the NiFi work directory. To maintain high availability it is recommended that you use a NiFi cluster. This will allow you to do rolling restarts so that your entire cluster is not down at the same time. If adding new components as part of this rolling update, you will not be able to use those new components until all nodes have been updated. Thanks, Matt

MattWho · ‎09-06-2016

@David DN Before Site-to-Site (S2S) can be used the following properties must be set in the nifi.properties file on all the Nodes in your NiFi cluster: # Site to Site properties nifi.remote.input.host=<FQDN of Host> <-- Set to resolveable FQDN by all Nodes nifi.remote.input.secure=false <-- Set to True on if NiFi is running HTTPS nifi.remote.input.socket.port=<Port used for S2S) <-- Needs to be set to support Raw/enable S2S nifi.remote.input.http.enabled=true <-- Set if you want to support HTTP transport nifi.remote.input.http.transaction.ttl=30 sec A restart of your NiFi instances will be necessary for this change to take affect. Matt

MattWho · ‎09-02-2016

@INDRANIL ROY Please share how you have your SplitText and RouteText processors configuration. If understand your end goal, you want to take this single files with 10,000,000 entries/lines and route only lines meeting criteria 1 to one putHDFS while route all other lines to another putHDFS? Thanks, Matt

MattWho · ‎08-31-2016

You can also save portions or all of you dataflow in a to NiFi templates that can be exported for use on other NiFi installations. To create a template simply highlight all the components you want in your template (If you highlight a process group, all components within that process group will be added to the template). Then click on the "create template" icon in the upper middle create your template. The Templates manager UI can be used to export and import these templates from your NiFi. It can be access via this icon in the upper right corner of the NiFi UI. *** Note: NiFi templates are sanitized of any sensitive properties values (A sensitive property value would be any value that would be encrypted. In NiFi that would be any passwords) Matt

MattWho · ‎08-31-2016

@Sami Ahmad Every change you make to the NiFi canvas is immediately saved to the flow.xml.gz file. No need to manually initiate a save. Each installation of NiFi provides you with a single UI for building a dataflow. You can build as many different dataflow as you want on this canvas. These different dataflows do not need to be connected in any way. The most common approach to what you are doing is to create a different process group for each of your unique dataflows. To add a new process group to the canvas, drag the process group icon on to the canvas and give it unique name that identifies the dataflow it contains. If you double click on that process group, you will enter it giving you a blank canvas to work with. So here you can see I have two process groups that are not connected in any way. One contains a dataflow that consists of 6 processors while the other has 89. I can right click on either of these process groups and select either start or stop from the context menu. That start or stop action is applied against every processor within that process group. so this gives you an easy way to stop one dataflow and start another. You could even have both running at the same time. Matt

MattWho · ‎08-31-2016

@Sami Ahmad An easy way to return NiFi to a blank canvas, is to simply stop NiFi and remove the flow.xml.gz file from the NiFi's conf directory. When you restart your NiFi a new blank flow.xml.gz file will be generated. Any FlowFiles that had existed in the deleted flow will be purged from NiFi when it is started. Alternatively: The error you are seeing is occurring because you are inside a NiFi process group and trying to delete all the components; however, NiFi has detected that there are connections attached to that process group from the process group one level up. NiFi will not allow those components to be removed until there feeding connections are removed first. If you return to the root/top level of your NiFi dataflow you can select the connection entering and existing a process group and delete them. Once they have been deleted, you can select the process group itself and delete it. This will in turn delete all components inside that process group. The deletion of connections will only be allowed if there is no queued FlowFiles in that connection. If there is queued FlowFiles, the FlowFiles must be purged before the connection can be deleted. Matt

Member Since	‎07-30-2019 10:41 AM
Last Visited
Posts	3,145
Kudos received	1562

Cloudera Community

Re: Cloudera NiFi - Automatic policy creation

Re: MergeRecord generates multiple files

Re: Flowfile stuck in Wait in EnforceOrder process...

Re: Untrusted proxy error Authentication Failed o....

Re: REST API Configuration for NiFi 2.0

Re: Load balancing while the fetching of file fro...

Re: Load balancing while the fetching of file fro...

Re: NIFI RouteText processor taking too long

Re: NIFI RouteText processor taking too long

Re: Hot swapping the jar in Nifi

Re: 'Remote instance of NiFi is not configured to ...

Re: Load balancing while the fetching of file fro...

Re: multiple Nifi dataflows together

Re: multiple Nifi dataflows together

Re: how to clean a NIFI screen