About MattWho

MattWho · ‎09-14-2016

RAID 1 is fine

MattWho · ‎09-14-2016

@Saikrishna Tarapareddy The section you are referring to is an example setup for a single server: CPU: 24 - 48 cores Memory: 64 -128 GB Hard Drive configuration: (1 hardware RAID 1 array) (2 or more hardware RAID 10 arrays) ** What falls between each "----------------" line is on a single mounted RAID/disk. A RAID can be broken up in to multiple logical volumes if desired. If it is, each / here represents a different logical volume. By creating logical volumes you can control how much disk space is reserved for each which is recommended. For example you would not want excessive logging to eat up space you want reserved for your flowfile-repo. Logical volumes allow you to control that by splitting up that single RAID into multiple logical volumes of a defined size. -------------------- RAID 1 array (This could also be a RAID 10) containing all the following directories/logical volumes: -/ -/boot -/home -/var -/var/log/nifi-logs <-- point all your NiFi logs (logback.xml) here -/opt <-- install NiFi here under a sub-directory -/database-repo <-- point NiFi database repository here -/flowfile-repo <-- point NiFi flowfile repository here -------------------- 1st RAID 10 array logical volumes mounted as /cont-repo1 -/cont-repo1 <-- point NiFi content repository here --------------------- 2nd RAID 10 array logical volumes mounted as /prov-repo1 - /prov-repo1 <-- point NiFi provenance repository here --------------------- 3rd RAID 10 array logical volumes (recommended) mounted as /cont-repo2 - / cont-repo2 <-- point 2nd NiFI content repository here ---------------------- In order to setup the above example you would need 14 hard disks (2) Raid 1 (4) Raid 10 (x3) * You would only need 10 disks if you decided to have only one Raid 10 content repo disk( but it would need to be 2 TB) You could also take a large Raid 10 like the with prov-repo1 and split it into multiple logical volumes giving away part of that RAID's disk space to content repo. Not sure what you mean by "load 2TB of data for future project"? Are you saying you want NiFi to be able handle a queue backlog of 2TB of data? If that is the case each of your cont-repo Raid 10s would need to be at least 1TB in size. ***While the nifi.properties file has a single line for the content and and provenance repo path, multiple repos can be added by adding additional new lines to this file as follows: nifi.content.repository.directory.default=/cont-repo1/content_repository nifi.content.repository.directory.cont-repo2=/cont-repo2/content_repository nifi.content.repository.directory.cont-repo3=/cont-repo3/content_repository etc... nifi.provenance.repository.directory.default=./provenance_repository nifi.provenance.repository.directory.prov-repo1=/prov-repo1/provenance_repository nifi.provenance.repository.directory.prov-repo2=/prov-repo2/provenance_repository etc.... When more then one repo is defined in the nifi.properties file, NiFi will perform file based striping across them. This allows NiFi to spread out the I/O across multiple disk helping improve overall performance. Thanks, Matt

MattWho · ‎09-14-2016

@Pravin Battula As a third option you could also build a flow to create the delay you are looking for. This can be done using the UpdateAttribute and RouteOnAttribute processors. Here is an example that causes a 5 minute delay to all FlowFiles that pass through these two processors: The value returned by the now() function is the current epoch time in milliseconds. To add 5 minutes we need to add 300,000 milliseconds to the current time and store tat as a new attribute on the FlowFile. We then check that new attribute against the current time in the routeOnAttribute processor. If the current time is not greater then the delayed time, the FlowFile is routed to unmatched. So here the FlowFile would be stuck in this loop for ~5 minutes. You cab adjust the run schedule on the RouteOnAttribute processor to the desired interval you want to re-check the file(s). 0 sec is the default, but I would recommend changing that to at least 1 sec. Thanks, Matt

MattWho · ‎09-13-2016

@Dung Nguyen *** This topic applies to HDF 2.0 and NiFI 1.0 versions only. Does not apply to HDF 1.x and NiFi 0.x versions. There are multiple permissions which need to be in place in order to perform Provenance queries and view the data returned by those queries. 1. Users who want to perform Provenance queries will need to have permission granted to the "query provenance" policy. Select "Policies" from the Hamburger menu in the upper right corner of NiFi UI: 2. In order to view the results (the data) returned by the provenance query, both the users and systems/servers in the NiFi cluster will need to have "view data" permissions to the components the query results are returned against. Policies are assigned at the component level by selecting a component and applying a policies as illustrated below: *** What is important to note here is that both users and servers in this NiFi cluster need "view the data" permissions or no query results will be displayed to the UI. In the above example i applied my policies to the root process group (Top level of canvas). Any components (Processors, process groups, etc...) created on this top layer will inherit these policies unless overwritten explicitly by their own policies. You can restrict what data users and systems can display down to the component/sub-component level if desired. Thanks, Matt

MattWho · ‎09-12-2016

@spdvnz NiFi's Hadoop based processors already include the Hadoop client libraries so there is no need to install them outside of NiFi or install NiFi on the same hardware where Hadoop is running. The various NiFi processors for communicating with Hadoop use the core-site.xml, hdfs-site.xml, and/or hbase-site.xml files as part of their configuration. These files would need to be copied from you Hadoop system(s) to a local directory on each of your NiFi instances for use by these processors. Detailed processor documentation is provided by clicking on "help" in the upper right corner within the NiFi UI. You can also get to the processor documentation by right clicking on a processor and slecting "usage" from teh context menu that is displayed. Thanks, Matt

MattWho · ‎09-12-2016

@spdvnz Check out this other article: https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html There is no difference between how a Node in a NiFi cluster and how a Standalone NiFi should be setup. BOth should follow the guidelines outlined in the above article. As of NiFi 1.x and HDF 2.x, a NiFi cluster no longer has a NiFi Cluster Manager (NCM) and therefore all systems would be setup the same. For NiFi 0.x and HDF 1.x versions the NCM does not process any data and therefore does not need the content repos, FlowFile repo or provenance repos. The NCM also does not require the same CPU horse power as the Nodes. The NCM can have a significant memory requirement depending on the number of attached nodes and the amount of processors added to the canvas. This is because all the processor and connection stats are reported to the NCM in heartbeats and stored in memory. Thanks, Matt

MattWho · ‎09-06-2016

@INDRANIL ROY The output from the SplitText and RouteText processors is a bunch of FlowFiles all with the same filename (filename of the original FlowFile they were derived from.) NiFi differentiates these FlowFiles by assigning each a Unique Identifier (UUID). The problem you have is then writing to HDFS only the first FlowFile written with a particular filename is successful. All others result in the error you are seeing. The MergeContent processor you added reduces the impact but does not solve your problem. Remember that nodes do not talk to one another or share files with one another. So each MergeContent is working on its own set of files all derived from the same original source file and each node is producing its own merged file with the same filename. The first node to successfully write its file HDFS wins and the other nodes throw the error you are seeing. What is typically done here is to add an UpdateAttribute processor after each of your MergeContent processor to force a unique name on each of the FlowFiles before writing to HDFS. The uuid that NiFi assigns to each of these FlowFiles is often prepended or appended to the filename to solve this problem: If you do not want to merge the FlowFiles, you can simply just add the UpdateAttribute processor in its place. YOu will just end up with a larger number of files written to HDFS. Thanks, Matt

MattWho · ‎09-06-2016

@INDRANIL ROY Your approach above looks good except you really want to split that large 50,000,000 line file in to many more smaller files. Your example shows you only splitting it in to 10 files which may not ensure good file distribution to the downstream NiFi cluster nodes. The RPG load balances batches of files (up to 100 at a time) for speed and efficiency purposes. With so few files it is likely that every file will still end up on the same downstream node instead of load balanced. However if you were to split the source file in to ~5,000 files, you would achieve much better load-balancing. Thanks, Matt

MattWho · ‎09-06-2016

Thank you for the clarification on my post.

MattWho · ‎09-06-2016

@INDRANIL ROY You have a couple things going on here that are affecting your performance. Based on previous HCC discussions you have a single 50,000,000 line file you are splitting in to 10 files (Each 5,000,000 lines) and then distributing those splits to your NiFi cluster via a RPG (Site-to-Site). You are then using the RouteText processor to read every line of these 5,000,000 line files and route the lines based on two conditions. 1. Most NiFi processors (including RouteText) are multi-thread capable by adding additional concurrent tasks. A single concurrent task can work on a single file or batch of files. Multiple threads will not work on the same file. So by setting your current tasks to 10 on the RouteText you may not actually be using 10. The NiFi controller also has a max number of threads configuration that limits the number of threads available across all components. The max thread setting can be found by clicking on this icon in the upper right corner of the UI. Most components by default use timer driven threads, so this is the number you will want to increase in most cases. Now keep in mind that your hardware also limits how much "work" you can do concurrently. With only 4 cores, you are fairly limited. You may want to up this value from the default 10 to perhaps 20. You can just end up with a lot of threads in cpu wait. Avoid getting carried away on your thread allocations (Both at the controller level and processor level). 2. In oder to get better multi-thread throughput on your RouteText processor, try splitting your incoming fie in to many smaller files. Try splitting your 50,000,000 line file in to files with no more then 10,000 lines each. The resulting 5,000 files will be better distributed across your NiFi cluster Nodes and allow the multiple threads to be utilized. Thanks, Matt

Member Since	‎07-30-2019 10:41 AM
Last Visited
Posts	3,131
Kudos received	1560

Cloudera Community

Re: Flowfile stuck in Wait in EnforceOrder process...

Re: Untrusted proxy error Authentication Failed o....

Re: REST API Configuration for NiFi 2.0

Re: Fileflow penalized for certain time before all...

Re: Nifi : Implement Sleep Mechanism in nifi witho...

Re: NiFI Server Configuration

Re: NiFI Server Configuration

Re: Is there a WAIT processor in NIFI?

Re: Data Provenance doesn't show in cluster mode?

Re: Hadoop and Nifi integration best practices

Re: Best Practices for Setting up Nifi in differen...

Re: Load balancing while the fetching of file fro...

Re: Load balancing while the fetching of file fro...

Re: NIFI RouteText processor taking too long

Re: NIFI RouteText processor taking too long