About MattWho

MattWho · ‎09-15-2016

@Saikrishna Tarapareddy The purpose of using a RAID is to protect against the loss of a disk. If the intent here is to protect against a complete catastrophic loss of the system, there are somethings you can do. Keeping a backup of the conf directory will allow you to quickly restore the sate of your NiFi's dataflow. Restoring the state of your dataflow does not restore any data that may have been active in the system at the time of failure. The NiFi repos contain the following information: Database repository --> Contains change history to the graph (Keep record of all changes made on the canvas). If NiFi is secured, this repo also contains the users db. Loss if either of these has little impact. Loss of configuration history will not impact your dataflow or data. The users db is rebuilt from the authorized-users.xml file (located in conf dir by default) upon NiFi start. Provenance repository(s) --> Contains NiFi FlowFile lineage history. Loss of this repo will not affect your dataflow or data. You will simply be unable to perform queries against data that traversed the system previous to the loss. FlowFile repository --> Loss of this Repos will result in loss of data. The FlowFile repo keeps all attributes about Content currently in the dataflow. This includes where to find the actual content in the content repository(s). The information in this repo changes rapidly so backing up this repo is not really feasible. Raid offers your best protection here. Content repository(s) --> Loss of this repo will also result in loss of data and archived data (If configured to archive). The content repository(s) contain the actual content of the data NiFi processes. The data in this repo also changes rapidly as files are processed through the NiFi dataflow(s), so backing up this repo(s) is also not feasible. Raid offers your best protection here as well. As you can see recover from disk failure is possible with RAID; however, a catastrophic loss of the entire system will result in loss of the data that was currently in mid processing by any of the dataflows. Your Repos could be external attached storage. (There is likely to be some performance impact because of this; however, in the event of catastrophic server loss a new server could be stood-up using the backed-up conf dir and attached to the same external storage. This would help prevent data loss and allow processing to pickup where it left off. Thanks, Matt

MattWho · ‎09-14-2016

If later you decide to add new disks you can simply cop[y your content repositories to those new disks and update the nifi.properties file repo config lines to point at the new locations.

MattWho · ‎09-14-2016

RAID 1 is fine

MattWho · ‎09-14-2016

@Saikrishna Tarapareddy The section you are referring to is an example setup for a single server: CPU: 24 - 48 cores Memory: 64 -128 GB Hard Drive configuration: (1 hardware RAID 1 array) (2 or more hardware RAID 10 arrays) ** What falls between each "----------------" line is on a single mounted RAID/disk. A RAID can be broken up in to multiple logical volumes if desired. If it is, each / here represents a different logical volume. By creating logical volumes you can control how much disk space is reserved for each which is recommended. For example you would not want excessive logging to eat up space you want reserved for your flowfile-repo. Logical volumes allow you to control that by splitting up that single RAID into multiple logical volumes of a defined size. -------------------- RAID 1 array (This could also be a RAID 10) containing all the following directories/logical volumes: -/ -/boot -/home -/var -/var/log/nifi-logs <-- point all your NiFi logs (logback.xml) here -/opt <-- install NiFi here under a sub-directory -/database-repo <-- point NiFi database repository here -/flowfile-repo <-- point NiFi flowfile repository here -------------------- 1st RAID 10 array logical volumes mounted as /cont-repo1 -/cont-repo1 <-- point NiFi content repository here --------------------- 2nd RAID 10 array logical volumes mounted as /prov-repo1 - /prov-repo1 <-- point NiFi provenance repository here --------------------- 3rd RAID 10 array logical volumes (recommended) mounted as /cont-repo2 - / cont-repo2 <-- point 2nd NiFI content repository here ---------------------- In order to setup the above example you would need 14 hard disks (2) Raid 1 (4) Raid 10 (x3) * You would only need 10 disks if you decided to have only one Raid 10 content repo disk( but it would need to be 2 TB) You could also take a large Raid 10 like the with prov-repo1 and split it into multiple logical volumes giving away part of that RAID's disk space to content repo. Not sure what you mean by "load 2TB of data for future project"? Are you saying you want NiFi to be able handle a queue backlog of 2TB of data? If that is the case each of your cont-repo Raid 10s would need to be at least 1TB in size. ***While the nifi.properties file has a single line for the content and and provenance repo path, multiple repos can be added by adding additional new lines to this file as follows: nifi.content.repository.directory.default=/cont-repo1/content_repository nifi.content.repository.directory.cont-repo2=/cont-repo2/content_repository nifi.content.repository.directory.cont-repo3=/cont-repo3/content_repository etc... nifi.provenance.repository.directory.default=./provenance_repository nifi.provenance.repository.directory.prov-repo1=/prov-repo1/provenance_repository nifi.provenance.repository.directory.prov-repo2=/prov-repo2/provenance_repository etc.... When more then one repo is defined in the nifi.properties file, NiFi will perform file based striping across them. This allows NiFi to spread out the I/O across multiple disk helping improve overall performance. Thanks, Matt

MattWho · ‎09-14-2016

@Pravin Battula As a third option you could also build a flow to create the delay you are looking for. This can be done using the UpdateAttribute and RouteOnAttribute processors. Here is an example that causes a 5 minute delay to all FlowFiles that pass through these two processors: The value returned by the now() function is the current epoch time in milliseconds. To add 5 minutes we need to add 300,000 milliseconds to the current time and store tat as a new attribute on the FlowFile. We then check that new attribute against the current time in the routeOnAttribute processor. If the current time is not greater then the delayed time, the FlowFile is routed to unmatched. So here the FlowFile would be stuck in this loop for ~5 minutes. You cab adjust the run schedule on the RouteOnAttribute processor to the desired interval you want to re-check the file(s). 0 sec is the default, but I would recommend changing that to at least 1 sec. Thanks, Matt

MattWho · ‎09-13-2016

@Dung Nguyen *** This topic applies to HDF 2.0 and NiFI 1.0 versions only. Does not apply to HDF 1.x and NiFi 0.x versions. There are multiple permissions which need to be in place in order to perform Provenance queries and view the data returned by those queries. 1. Users who want to perform Provenance queries will need to have permission granted to the "query provenance" policy. Select "Policies" from the Hamburger menu in the upper right corner of NiFi UI: 2. In order to view the results (the data) returned by the provenance query, both the users and systems/servers in the NiFi cluster will need to have "view data" permissions to the components the query results are returned against. Policies are assigned at the component level by selecting a component and applying a policies as illustrated below: *** What is important to note here is that both users and servers in this NiFi cluster need "view the data" permissions or no query results will be displayed to the UI. In the above example i applied my policies to the root process group (Top level of canvas). Any components (Processors, process groups, etc...) created on this top layer will inherit these policies unless overwritten explicitly by their own policies. You can restrict what data users and systems can display down to the component/sub-component level if desired. Thanks, Matt

MattWho · ‎09-12-2016

@spdvnz NiFi's Hadoop based processors already include the Hadoop client libraries so there is no need to install them outside of NiFi or install NiFi on the same hardware where Hadoop is running. The various NiFi processors for communicating with Hadoop use the core-site.xml, hdfs-site.xml, and/or hbase-site.xml files as part of their configuration. These files would need to be copied from you Hadoop system(s) to a local directory on each of your NiFi instances for use by these processors. Detailed processor documentation is provided by clicking on "help" in the upper right corner within the NiFi UI. You can also get to the processor documentation by right clicking on a processor and slecting "usage" from teh context menu that is displayed. Thanks, Matt

MattWho · ‎09-12-2016

@spdvnz Check out this other article: https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html There is no difference between how a Node in a NiFi cluster and how a Standalone NiFi should be setup. BOth should follow the guidelines outlined in the above article. As of NiFi 1.x and HDF 2.x, a NiFi cluster no longer has a NiFi Cluster Manager (NCM) and therefore all systems would be setup the same. For NiFi 0.x and HDF 1.x versions the NCM does not process any data and therefore does not need the content repos, FlowFile repo or provenance repos. The NCM also does not require the same CPU horse power as the Nodes. The NCM can have a significant memory requirement depending on the number of attached nodes and the amount of processors added to the canvas. This is because all the processor and connection stats are reported to the NCM in heartbeats and stored in memory. Thanks, Matt

MattWho · ‎09-06-2016

@INDRANIL ROY The output from the SplitText and RouteText processors is a bunch of FlowFiles all with the same filename (filename of the original FlowFile they were derived from.) NiFi differentiates these FlowFiles by assigning each a Unique Identifier (UUID). The problem you have is then writing to HDFS only the first FlowFile written with a particular filename is successful. All others result in the error you are seeing. The MergeContent processor you added reduces the impact but does not solve your problem. Remember that nodes do not talk to one another or share files with one another. So each MergeContent is working on its own set of files all derived from the same original source file and each node is producing its own merged file with the same filename. The first node to successfully write its file HDFS wins and the other nodes throw the error you are seeing. What is typically done here is to add an UpdateAttribute processor after each of your MergeContent processor to force a unique name on each of the FlowFiles before writing to HDFS. The uuid that NiFi assigns to each of these FlowFiles is often prepended or appended to the filename to solve this problem: If you do not want to merge the FlowFiles, you can simply just add the UpdateAttribute processor in its place. YOu will just end up with a larger number of files written to HDFS. Thanks, Matt

MattWho · ‎09-06-2016

@INDRANIL ROY Your approach above looks good except you really want to split that large 50,000,000 line file in to many more smaller files. Your example shows you only splitting it in to 10 files which may not ensure good file distribution to the downstream NiFi cluster nodes. The RPG load balances batches of files (up to 100 at a time) for speed and efficiency purposes. With so few files it is likely that every file will still end up on the same downstream node instead of load balanced. However if you were to split the source file in to ~5,000 files, you would achieve much better load-balancing. Thanks, Matt

Member Since	‎07-30-2019 10:41 AM
Last Visited
Posts	3,123
Kudos received	1559

Cloudera Community

Re: REST API Configuration for NiFi 2.0

Re: Fileflow penalized for certain time before all...

Re: Nifi : Implement Sleep Mechanism in nifi witho...

Re: Nifi not starting because of OnScheduled()

Re: NiFi for file movement

Re: NiFI Server Configuration

Re: NiFI Server Configuration

Re: NiFI Server Configuration

Re: NiFI Server Configuration

Re: Is there a WAIT processor in NIFI?

Re: Data Provenance doesn't show in cluster mode?

Re: Hadoop and Nifi integration best practices

Re: Best Practices for Setting up Nifi in differen...

Re: Load balancing while the fetching of file fro...

Re: Load balancing while the fetching of file fro...