Created 09-13-2016 08:11 PM
Hi , We are trying to setup a stand alone NiFi server in our HADOOP environment on cloud and trying to determine the best configurations for it. We will have one stand alone server on-site to do site-to-site with cloud NiFi.
We don't have many use cases as of now and may get more in future , based on it we may go to a clustered environment.
we may have to load 2 TB data for a future project , keeping that in mind i am trying to figure out the suitable configurations for our servers for Number of Cores ,RAM,Hard Drive etc
Thanks,
Sai
Created 10-11-2016 03:09 PM
The retention settings in the nifi.properties file are for NiFi data archive only. They do not apply to files that are active (queued or still being processed) in any of your dataflows. NiFi will allow you to continue to queue data in your dataflow all the way up to the point where your content repository disk is 100% utilized. That is why backpressure on dataflow connections throughout your dataflow is important to control the amount of FlowFiles that can be queued. Also important to isolate the content repository from other NiFi repositories so if it fills the disk, it does not cause corruption of those other repositories.
If content repository archiving is enabled
nifi.content.repository.archive.enabled=true
then the retention and usage percentage settings in the nifi.properties file take affect. NiFi will archive FlowFiles once they are auto-terminated at the end of a dataflow. Data active your dataflow will always take priority over archived data. If your dataflow should queue to the point your content repository disk is full, the archive will be empty.
The purpose of archiving data is to allow users to replay data from any point in the dataflow or be able to download and examine the content of a FlowFile post processing through a dataflow via the NiFi provenance UI. For many this is a valuable feature and to other not so important. If is not important for your org to archive any data, you can simply set archive enabled to false.
FlowFiles that are not processed successfully within your dataflow are routed to failure relationships. As long as you do not auto-terminate any of your failure relationships, the FlowFiles remain active/queued in your dataflow. You can then build some failure handling dataflow if you like to make sure you do not lose that data.
Matt
Created 09-14-2016 02:26 AM
I am also interested to know if Nifi is processer heavy or memory heavy tool.
Created 09-14-2016 05:58 AM
The type and size of hardware needed for Nifi are really dependent on your load. Nifi stores data on disk while processing it. So you need sufficient disk capacity for your content repository, flow file repository as well as provenance (data lineage) repository. Have you enabled archiving (I am assuming, yes). Then, for how long do you archive your data? You need space for that.
To your question about whether Nifi is memory intensive or processor intensive, the answer is processor. Unless, you are doing bulk loads which I think you should not, you likely want to make sure you have enough processing power. Please see following link for performance expectations.
Created 09-14-2016 06:04 PM
Hi @mclark
I just read this article by you . It was very helpful.
In your section about Hardware , i am trying to understand about this.
using RAID 1 array we will be able to point different repositories to different hard disk drives.??
if so do we need to specify the size of each disk as the one we are going to use for content repo needs to be in TBs whereas flow file repo and prov repo doesn't have to be like that...
anything else apart from repos needs to be on seperate drives.?
in your break down here each / is point to a different drive.??
(1 hardware RAID 1 array)
RAID 1 array (This could also be a RAID 10) logical volumes:
Created 09-14-2016 08:46 PM
The section you are referring to is an example setup for a single server:
CPU: 24 - 48 cores
Memory: 64 -128 GB
Hard Drive configuration:
(1 hardware RAID 1 array)
(2 or more hardware RAID 10 arrays)
** What falls between each "----------------" line is on a single mounted RAID/disk. A RAID can be broken up in to multiple logical volumes if desired. If it is, each / here represents a different logical volume. By creating logical volumes you can control how much disk space is reserved for each which is recommended. For example you would not want excessive logging to eat up space you want reserved for your flowfile-repo. Logical volumes allow you to control that by splitting up that single RAID into multiple logical volumes of a defined size.
--------------------
RAID 1 array (This could also be a RAID 10) containing all the following directories/logical volumes:
--------------------
1st RAID 10 array logical volumes mounted as /cont-repo1
---------------------
2nd RAID 10 array logical volumes mounted as /prov-repo1
- /prov-repo1 <-- point NiFi provenance repository here
---------------------
3rd RAID 10 array logical volumes (recommended) mounted as /cont-repo2
- / cont-repo2 <-- point 2nd NiFI content repository here
----------------------
In order to setup the above example you would need 14 hard disks
(2) Raid 1
(4) Raid 10 (x3) * You would only need 10 disks if you decided to have only one Raid 10 content repo disk( but it would need to be 2 TB) You could also take a large Raid 10 like the with prov-repo1 and split it into multiple logical volumes giving away part of that RAID's disk space to content repo.
Not sure what you mean by "load 2TB of data for future project"? Are you saying you want NiFi to be able handle a queue backlog of 2TB of data? If that is the case each of your cont-repo Raid 10s would need to be at least 1TB in size.
***While the nifi.properties file has a single line for the content and and provenance repo path, multiple repos can be added by adding additional new lines to this file as follows: nifi.content.repository.directory.default=/cont-repo1/content_repository
nifi.content.repository.directory.cont-repo2=/cont-repo2/content_repository
nifi.content.repository.directory.cont-repo3=/cont-repo3/content_repository
etc...
nifi.provenance.repository.directory.default=./provenance_repository
nifi.provenance.repository.directory.prov-repo1=/prov-repo1/provenance_repository
nifi.provenance.repository.directory.prov-repo2=/prov-repo2/provenance_repository
etc....
When more then one repo is defined in the nifi.properties file, NiFi will perform file based striping across them. This allows NiFi to spread out the I/O across multiple disk helping improve overall performance.
Thanks,
Matt
Created 09-14-2016 09:22 PM
@mclark Thank you Matt. We are trying to setup a standalone server.
lets say if we cant afford RAID 10 disks. What are the alternate better options. Can we go with RAID 1 for all.?? and in future if we decided to go for RAID 10 can we easily change the config files and migrate.?
Created 09-14-2016 09:39 PM
RAID 1 is fine
Created 09-14-2016 09:41 PM
If later you decide to add new disks you can simply cop[y your content repositories to those new disks and update the nifi.properties file repo config lines to point at the new locations.
Created 09-15-2016 01:53 PM
@mclark ,Thank you. also what happens if the server breaks down or we loose the disks because of corruptions or \ failures.?? how do we get back NiFi to its previous state.? can we persist NiFi's state to any database.? i read some where that if we can keep a back up of /conf folder , we should be in good shape .?
any help here is much appreciated.
Created 09-15-2016 02:28 PM
The purpose of using a RAID is to protect against the loss of a disk. If the intent here is to protect against a complete catastrophic loss of the system, there are somethings you can do.
Keeping a backup of the conf directory will allow you to quickly restore the sate of your NiFi's dataflow. Restoring the state of your dataflow does not restore any data that may have been active in the system at the time of failure.
The NiFi repos contain the following information:
Database repository --> Contains change history to the graph (Keep record of all changes made on the canvas). If NiFi is secured, this repo also contains the users db. Loss if either of these has little impact. Loss of configuration history will not impact your dataflow or data. The users db is rebuilt from the authorized-users.xml file (located in conf dir by default) upon NiFi start.
Provenance repository(s) --> Contains NiFi FlowFile lineage history. Loss of this repo will not affect your dataflow or data. You will simply be unable to perform queries against data that traversed the system previous to the loss.
FlowFile repository --> Loss of this Repos will result in loss of data. The FlowFile repo keeps all attributes about Content currently in the dataflow. This includes where to find the actual content in the content repository(s). The information in this repo changes rapidly so backing up this repo is not really feasible. Raid offers your best protection here.
Content repository(s) --> Loss of this repo will also result in loss of data and archived data (If configured to archive). The content repository(s) contain the actual content of the data NiFi processes. The data in this repo also changes rapidly as files are processed through the NiFi dataflow(s), so backing up this repo(s) is also not feasible. Raid offers your best protection here as well.
As you can see recover from disk failure is possible with RAID; however, a catastrophic loss of the entire system will result in loss of the data that was currently in mid processing by any of the dataflows.
Your Repos could be external attached storage. (There is likely to be some performance impact because of this; however, in the event of catastrophic server loss a new server could be stood-up using the backed-up conf dir and attached to the same external storage. This would help prevent data loss and allow processing to pickup where it left off.
Thanks,
Matt