About MattWho

MattWho · ‎10-27-2016

@Joshua Adeleke First lets make sure we are on the same page terminology wise.... NiFi FlowFiles --> FlowFiles are made up of two parts, FlowFile Content and FlowFile Attributes. FlowFiles Content is written to NiFI's content Repository while FlowFile Attributes mostly live in JVM heap memory and the NiFi FlowFile repository. It is the FlowFile attributes that move from processor to processor in your dataflow. Apache NiFi does not have a version 1.0.0.2.0.0.0-579. That is an HDF version of Apache NiFi 1.0.0. If you are migrating to HDF 2.0, I suggest instead migrating to HDF 2.0.1. It has many bug fixes you will want... https://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.0.1/index.html When you say you want to move the "dataflow process: to another server, are you talking about your entire canvas and all configured processors only? Or are you also talking about moving any active FlowFiles within the old flow to the new? My suggestion would be to stand-up your new NiFi instance/cluster. You can copy your flow.xml.gz file from your old NiFi to the new NiFi. ***NOTE: you need to make sure that the same sensitive props key is used (config setting for this is found in nifi.properties file) or your new NiFi will not start because it will not be able to decrypt any of your sensitive properties in that file. If your old NiFi was secured, you can use its authorized-users.xml file to establish the initial admin authorities in the new NiFi (configure this in authorizers.xml file). Once started you will need to access this new NiFi's UI and address any "invalid" processor/controller services, and add any controller services you may have had running on the NCM only in the old version. (No NCM in HDF 2.x versions). Some of these may be invalid because of changes to the processor/controller service properties. Once all has been addressed you are ready to move to the next step. Shutdown all ingest processor on your old NiFi and allow it to finish processing out any data it is already working on. At the same time you can start your new NiFi NiFi so it starts ingesting any new data and begins processing it. Thanks, Matt

MattWho · ‎10-26-2016

@Saikrishna Tarapareddy You can actually cut the ExtractText processor out of this flow. I forgot the RouteText processor generates a "RouteText.Group" FlowFile attribute. You can just use that attribute as the "Correlation Attribute Name" in the MergeContent processor.

MattWho · ‎10-26-2016

@Saikrishna Tarapareddy I agree that you may still need to split your very large incoming FlowFile into smaller FlowFiles to better manage heap memory usage, but you should be able to use the RouteText and ExtractText as follows to accomplish what you want: RouteText configured as follows: All Grouped lines will be routed to relationship "TagName" as a new FlowFile. They feed into an ExtractText configured as follows: This will extract the TagName as an attribute of on the FlowFile which you can then use as the correlationAttribute name in the MergeContent processor that follows. Thanks, Matt

MattWho · ‎10-26-2016

@Zack Riesland There are seven fields; however, the seventh field is optional. So you are correct. so both " 0 0 18 * * ? " and " 0 0 18 * * ? * " are valid. The below is from http://www.quartz-scheduler.org/documentation/quartz-2.x/tutorials/crontrigger.html ---------------- * (“all values”) - used to select all values within a field. For example, “” in the minute field means *“every minute”. ? (“no specific value”) - useful when you need to specify something in one of the two fields in which the character is allowed, but not the other. For example, if I want my trigger to fire on a particular day of the month (say, the 10th), but don’t care what day of the week that happens to be, I would put “10” in the day-of-month field, and “?” in the day-of-week field. See the examples below for clarification. ----------------- so only fields 4 and 6 will accept ?. Thanks, Matt

MattWho · ‎10-26-2016

Same hold true for the MergeContent side of this flow. Have a MergeContent merger the first 10,000 FlowFiles and a second merger multiple 10,000 line FlowFiles into even larger merged FlowFiles. This again will help prevent running in to OOM errors.

MattWho · ‎10-26-2016

@Saikrishna Tarapareddy You may consider using the RouteText processor to route the individual lines from your source FlowFile to relationships based upon your various Tagnames and then use mergeContent processors to merger those lines back in to a single FlowFile.

MattWho · ‎10-26-2016

Is user2@domain.net part of your "Admin NiFi" user group? Did you grant "Admin Group" the "modify the data" policy? You can set DEBUG in you logback.xml file for the following line to get more output in your nifi-users.log: <logger name="org.apache.nifi.web.api.config" level="INFO" additivity="false"> No nifi restarts are needed for any changes to the logback.xml file to take affect. Matt

MattWho · ‎10-26-2016

The Quartz scheduler has 7 fields, so the cron would need to be 0 0 18 * * ? *. The seventh field is optional for year. Yes the cron you have there will run the 18th hour of every day.

MattWho · ‎10-26-2016

@Paul Yang What you have here is very light data flow based on the picture shown. The NiFi RPG will send data in batches of up to 100 for efficiency. So if the input queue has less then 100 files in it when it runs, all of those FlowFile will be routed to a single Node. On next run the next batch would go to a different node. Over time if the dataflow rate is constant, the data should be balanced across your nodes. If i am understanding what you have here, you are feeding the RPG that feeds an input port. That input port feeds an output port. Then you can use various RPGs anywhere in your flow to pull data from that output port. correct? The problem with this is that the RPG runs on every Node. so when a node connects he will try to pull all the files he sees on that connection. Nodes are not aware of how many nodes exist in its cluster and will not say I should only pull x amount so the other nodes can pull the same. Each node acts in a a vacuum and pulls as much data as fast as it can from the output port. I would suggest instead having your remote input port (root level input port) feed its success relationship multiple times in the various sub process groups owned by your various departments. Not only will this provide a better load-balanced delivery of data in the cluster, but it will also improve performance. Thanks, Matt

MattWho · ‎10-26-2016

If after adding "modify the data" policy it still does not work, check the nifi-user.log to see what entity it is having permissions problems with? Did you set processor level policies on the processors on each side of this queued connection?

Online	Offline
Last Visited	‎11-08-2025 03:28 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎11-08-2025 03:28 PM
Posts	3,387
Kudos received	1613

Cloudera Community

Re: How to achieve inheritence within Parameter Co...

Re: using nifi as a kafka streaming- real-time str...

Re: using nifi as a kafka streaming- real-time str...

Re: Nifi Registry and LDAP

Re: NiFi logs not rolling over on Windows

Re: Migrating NiFi flow files between servers

Re: What is a good approach for Spilitting 100GB f...

Re: What is a good approach for Spilitting 100GB f...

Re: Helping setting up cron-based nifi processor

Re: What is a good approach for Spilitting 100GB f...

Re: What is a good approach for Spilitting 100GB f...

Re: NIFI - policies for Connection

Re: Helping setting up cron-based nifi processor

Re: How to fetch rows from a table in parallel wh...

Re: NIFI - policies for Connection