Which will be better content repository configuration ?
1 8 TB disk array configured as RAID 1 with 1 mount point.
or 4 2TB disks configured as RAID 1 with 4 mount points.
Does mount point make a difference in terms of how content repository is utilized by NiFi ?
The second option will offer best performances. I'd suggest you having a look at the following documentation (still a draft at the moment) regarding best practices : https://github.com/JPercivall/nifi/blob/NIFI-1028/nifi-docs/src/main/asciidoc/nifi-in-depth.adoc
For the Provenance and Content repos, there is the option to stripe the information across multiple physical partitions. An admin would do this if they wanted to federate reads and writes across multiple disks. The repo (Content or Provenance) is still one logical store but writes will be striped across multiple volumes/partitions automatically by the system. The directories are specified in the nifi.properties file.
I think I was not clear on my original question. I am not sure if both configuration are same.
Option 1: 8 1 TB disks configured as RAID 1 array with 1 mount points
Option 2: 8 1 TB disks configured as RAID 1 with 4 mount points
With option 1, better utilization of disk space and it will ensure that all my disks are equally utilized.
Option 2 may cause one mount point to be full while other may still have space remaining .
@jpercivall Joe will know for sure.
My understanding is that using multiple disks will offer better performances regarding I/O operations. The system will handle the repartition of the content over the disks and will see the set of disks as one single volume.
You are correct Pierre, option 2 is better due to the I/O benefits it offers. As note, in order to configure the content repo to use multiple partitions you must explicitly mention them in the nifi.properties file. You can find information explaining it in the "Content Repository" subsection of the "System Properties" section in the Admin Guide. Look for the "nifi.content.repository.directory" property.
The key to good disk performance is to separate the repositories across multiple spindles, so that the content_repository, flowfile_repository and provenance_repository are on physically separate disks. This is because flows will cause changes to all these. Your best bet in your configuration is to reserve 2 RAID 1 mounts for the content_repository, nifi will then evenly balance content over these drives, and then to have a RAID 1 for the flowfile_repository and one for the provenance_repository. This may not give you the most efficient use of space, but space is cheap, and it will give you much better performance than trying to push everything onto a single mount.
@Simon Elliston Ball Thanks Simon. I already have separate mounts for flowfile, provenance, OS and logs. I am just trying to understand if it is better to have one mount point with all disks for content repository or should I have multiple mount points for physical disks.
According to a simple test I ran, it does a round robin, writing one content claim to each directory. So, as long as your data comes in randomly, it should be pretty well distributed. If every fourth data input is larger than others, you may see one volume used more than others.
My test was simple. Four directories (actually on the same volume) for the content repos. Then I used GenerateFlowFile, 30MB each. It was clear that it was doing round robin as each directory grew.
There is a chance that it is smarter and also considers disk space on each volume, but I was unable to test that.