About MattWho

MattWho · ‎09-20-2016

@Gerd Koenig Can you try changing the value you have for "Message Delimiter" from "\n" to an actual new line in your PutKafka processor? You can add a new line by holding the Shift key while hitting enter. The result will appear as below: Thanks, Matt

MattWho · ‎09-20-2016

@Gerd Koenig The question here is are you running Apache NiFi 0.6 or HDF 1.2? I believe you are using Apache NiFi 0.6 which does not understand PLAINTEXTSASL as the security protocol. The Kafka 0.8 in HDP 2.3.2 and the Kafka 0.9 in HDP 2.3.4 use a custom Hortonworks Kafka client library. Kafka 0.8 in HDP 2.3.2 introduced support for kerberos before it was supported in the community. That support introduced the PLAINTEXTSASL security protocol. later when Apache Kafka 0.9 added kerberos support they used a different security protocol (SASL_PLAINTEXT). In order for HDF 1.2 to work with HDP 2.3.2, the GetKafka processor was modified from the Apache GetKafka to use that modified client library. Hortonworks again modified the client lib in HDP 2.3.4 for Kafka 0.9 so that it was backwards compatible and still supported the PLAINTEXTSASL security protocol. So bottom line here is that HDF 1.2 NiFi can talk kerberos to both HDP 2.3.2 (Kafka 0.8) and HDP 2.3.4 (Kafka 0.9), but Apache NiFi cannot. The NiFi consume and publish Kafka processor available in NiFi 0.7, NiFi 1.0, and HDF 2.0 do not use the Hortonworks custom Kafka client lib and can be used with Kafka 0.9 but not Kafka 0.8. You will need to use the SASL_PLAINTEXT security protocol with these new processors. Thanks, Matt

MattWho · ‎09-16-2016

@David Morris The nifi expression language can be used to route your data based on file extensions as you have described. When NiFi ingested data a NiFi FlowFile is created. That FlowFile is a combination of the original content and Metadata about that content. Upon ingest some metadata is created for every FlowFile. One of those attributes is named "filename" and contains the original filename of the ingested file. The RouteOnAttribute can use the NiFi Expression Language to evaluate the Flowfile's "filename" attribute fro routing purposes: In the RouteOnAttribute processor you would need to add new properties fro each file extension type you want to look for: Each one of those newly added properties become new relationships for that processor that can then be routed to follow-on processors as seen in the example above. Thanks, Matt

MattWho · ‎09-15-2016

@Saikrishna Tarapareddy The purpose of using a RAID is to protect against the loss of a disk. If the intent here is to protect against a complete catastrophic loss of the system, there are somethings you can do. Keeping a backup of the conf directory will allow you to quickly restore the sate of your NiFi's dataflow. Restoring the state of your dataflow does not restore any data that may have been active in the system at the time of failure. The NiFi repos contain the following information: Database repository --> Contains change history to the graph (Keep record of all changes made on the canvas). If NiFi is secured, this repo also contains the users db. Loss if either of these has little impact. Loss of configuration history will not impact your dataflow or data. The users db is rebuilt from the authorized-users.xml file (located in conf dir by default) upon NiFi start. Provenance repository(s) --> Contains NiFi FlowFile lineage history. Loss of this repo will not affect your dataflow or data. You will simply be unable to perform queries against data that traversed the system previous to the loss. FlowFile repository --> Loss of this Repos will result in loss of data. The FlowFile repo keeps all attributes about Content currently in the dataflow. This includes where to find the actual content in the content repository(s). The information in this repo changes rapidly so backing up this repo is not really feasible. Raid offers your best protection here. Content repository(s) --> Loss of this repo will also result in loss of data and archived data (If configured to archive). The content repository(s) contain the actual content of the data NiFi processes. The data in this repo also changes rapidly as files are processed through the NiFi dataflow(s), so backing up this repo(s) is also not feasible. Raid offers your best protection here as well. As you can see recover from disk failure is possible with RAID; however, a catastrophic loss of the entire system will result in loss of the data that was currently in mid processing by any of the dataflows. Your Repos could be external attached storage. (There is likely to be some performance impact because of this; however, in the event of catastrophic server loss a new server could be stood-up using the backed-up conf dir and attached to the same external storage. This would help prevent data loss and allow processing to pickup where it left off. Thanks, Matt

MattWho · ‎09-14-2016

If later you decide to add new disks you can simply cop[y your content repositories to those new disks and update the nifi.properties file repo config lines to point at the new locations.

MattWho · ‎09-14-2016

RAID 1 is fine

MattWho · ‎09-14-2016

@Saikrishna Tarapareddy The section you are referring to is an example setup for a single server: CPU: 24 - 48 cores Memory: 64 -128 GB Hard Drive configuration: (1 hardware RAID 1 array) (2 or more hardware RAID 10 arrays) ** What falls between each "----------------" line is on a single mounted RAID/disk. A RAID can be broken up in to multiple logical volumes if desired. If it is, each / here represents a different logical volume. By creating logical volumes you can control how much disk space is reserved for each which is recommended. For example you would not want excessive logging to eat up space you want reserved for your flowfile-repo. Logical volumes allow you to control that by splitting up that single RAID into multiple logical volumes of a defined size. -------------------- RAID 1 array (This could also be a RAID 10) containing all the following directories/logical volumes: -/ -/boot -/home -/var -/var/log/nifi-logs <-- point all your NiFi logs (logback.xml) here -/opt <-- install NiFi here under a sub-directory -/database-repo <-- point NiFi database repository here -/flowfile-repo <-- point NiFi flowfile repository here -------------------- 1st RAID 10 array logical volumes mounted as /cont-repo1 -/cont-repo1 <-- point NiFi content repository here --------------------- 2nd RAID 10 array logical volumes mounted as /prov-repo1 - /prov-repo1 <-- point NiFi provenance repository here --------------------- 3rd RAID 10 array logical volumes (recommended) mounted as /cont-repo2 - / cont-repo2 <-- point 2nd NiFI content repository here ---------------------- In order to setup the above example you would need 14 hard disks (2) Raid 1 (4) Raid 10 (x3) * You would only need 10 disks if you decided to have only one Raid 10 content repo disk( but it would need to be 2 TB) You could also take a large Raid 10 like the with prov-repo1 and split it into multiple logical volumes giving away part of that RAID's disk space to content repo. Not sure what you mean by "load 2TB of data for future project"? Are you saying you want NiFi to be able handle a queue backlog of 2TB of data? If that is the case each of your cont-repo Raid 10s would need to be at least 1TB in size. ***While the nifi.properties file has a single line for the content and and provenance repo path, multiple repos can be added by adding additional new lines to this file as follows: nifi.content.repository.directory.default=/cont-repo1/content_repository nifi.content.repository.directory.cont-repo2=/cont-repo2/content_repository nifi.content.repository.directory.cont-repo3=/cont-repo3/content_repository etc... nifi.provenance.repository.directory.default=./provenance_repository nifi.provenance.repository.directory.prov-repo1=/prov-repo1/provenance_repository nifi.provenance.repository.directory.prov-repo2=/prov-repo2/provenance_repository etc.... When more then one repo is defined in the nifi.properties file, NiFi will perform file based striping across them. This allows NiFi to spread out the I/O across multiple disk helping improve overall performance. Thanks, Matt

MattWho · ‎09-14-2016

@Pravin Battula As a third option you could also build a flow to create the delay you are looking for. This can be done using the UpdateAttribute and RouteOnAttribute processors. Here is an example that causes a 5 minute delay to all FlowFiles that pass through these two processors: The value returned by the now() function is the current epoch time in milliseconds. To add 5 minutes we need to add 300,000 milliseconds to the current time and store tat as a new attribute on the FlowFile. We then check that new attribute against the current time in the routeOnAttribute processor. If the current time is not greater then the delayed time, the FlowFile is routed to unmatched. So here the FlowFile would be stuck in this loop for ~5 minutes. You cab adjust the run schedule on the RouteOnAttribute processor to the desired interval you want to re-check the file(s). 0 sec is the default, but I would recommend changing that to at least 1 sec. Thanks, Matt

MattWho · ‎09-13-2016

@Dung Nguyen *** This topic applies to HDF 2.0 and NiFI 1.0 versions only. Does not apply to HDF 1.x and NiFi 0.x versions. There are multiple permissions which need to be in place in order to perform Provenance queries and view the data returned by those queries. 1. Users who want to perform Provenance queries will need to have permission granted to the "query provenance" policy. Select "Policies" from the Hamburger menu in the upper right corner of NiFi UI: 2. In order to view the results (the data) returned by the provenance query, both the users and systems/servers in the NiFi cluster will need to have "view data" permissions to the components the query results are returned against. Policies are assigned at the component level by selecting a component and applying a policies as illustrated below: *** What is important to note here is that both users and servers in this NiFi cluster need "view the data" permissions or no query results will be displayed to the UI. In the above example i applied my policies to the root process group (Top level of canvas). Any components (Processors, process groups, etc...) created on this top layer will inherit these policies unless overwritten explicitly by their own policies. You can restrict what data users and systems can display down to the component/sub-component level if desired. Thanks, Matt

MattWho · ‎09-12-2016

@spdvnz NiFi's Hadoop based processors already include the Hadoop client libraries so there is no need to install them outside of NiFi or install NiFi on the same hardware where Hadoop is running. The various NiFi processors for communicating with Hadoop use the core-site.xml, hdfs-site.xml, and/or hbase-site.xml files as part of their configuration. These files would need to be copied from you Hadoop system(s) to a local directory on each of your NiFi instances for use by these processors. Detailed processor documentation is provided by clicking on "help" in the upper right corner within the NiFi UI. You can also get to the processor documentation by right clicking on a processor and slecting "usage" from teh context menu that is displayed. Thanks, Matt

Member Since	‎07-30-2019 10:41 AM
Last Visited
Posts	3,126
Kudos received	1560

Cloudera Community

Re: REST API Configuration for NiFi 2.0

Re: Fileflow penalized for certain time before all...

Re: Nifi : Implement Sleep Mechanism in nifi witho...

Re: Nifi not starting because of OnScheduled()

Re: NiFi for file movement

Re: standalone NiFi + putKafka into kerberized Kaf...

Re: standalone NiFi + putKafka into kerberized Kaf...

Re: NiFi Filtering for Kafka Pipeliine Purposes

Re: NiFI Server Configuration

Re: NiFI Server Configuration

Re: NiFI Server Configuration

Re: NiFI Server Configuration

Re: Is there a WAIT processor in NIFI?

Re: Data Provenance doesn't show in cluster mode?

Re: Hadoop and Nifi integration best practices