About MattWho

MattWho · ‎11-08-2016

@Sunile Manjee You are absolutely correct, HDF does not require that all the HDF services are installed. All the services are installed through RPMs. When you run through the Ambari Wizard you will be asked to select the services you wish to "Deploy". By default they are all checked except for log search which is currently Technical Preview (TP). Simply uncheck all the services you do not want to install. If you uncheck something that is a dependency of NiFi, the wizard will let you know. Thanks, Matt

MattWho · ‎11-08-2016

@Jobin George no, Ambari based HDF deployments force an external ZK. That ZK is used by the other component services available in the HDF stack as well.

MattWho · ‎11-08-2016

@Sunile Manjee As far as best practices goes, we do not recommend installing ZK on the same servers/nodes as NiFi. NiFi dataflows can be very CPU, disk, and/or memory intensive. Any of these can interfere with ZK --> NiFi comms/performance. This can result in NiFi nodes dropping from cluster, new NiFi cluster coordinators being assigned, and/or new primary nodes being elected frequently. While it does work, i would keep away from co-location in production for sure. Thank, Matt

MattWho · ‎11-08-2016

@Sunile Manjee There is no reason you can't use another ZK (Including the one provided in HDP). While there is currently no support for installing NiFi within an HDP Ambari stack, you can point your NiFi installation via its config at the ZK quorum in your HDP stack. If you install NiFi via the HDF Ambari stack, it does have a dependency that forces the installation of ZK in the HDF stack and configures your NiFi service to use it. You can however alter the NiFi configs to use your other ZK. If you install HDF NiFi via command line and not with Ambari, you can configure it to use your HDP ZK quorum out the gate. Thanks, Matt

MattWho · ‎11-08-2016

@Sunile Manjee The "nifi.cluster.is.node" parameter is used to specify whether the NiFi installation is a standalone installation (false) or a node in a NiFi cluster (true). When set to true, things like ZK are required as "cluster" wide state management will not take affect and is stored in ZK. With true NiFi also requires a ZK for the NiFi Cluster (even if you have only 1 node) The NiFi node will send heartbeats to ZK and a primary node and cluster coordinator will be elected. By setting to false, NiFi does not need a ZK for all those things above. [11:24 AM] Matthew Clarke: State management is only local as well. You will get better performance out of a Standalone NIFi (false) then you will out of a 1 node cluster (true with only one node) because you reduce the overhead by not having the ZK piece. Thanks, Matt

MattWho · ‎11-07-2016

@Ronak Jangir The HDFS client does not currently support the LzoCodec and the core-site.xml file you are using includes it. It should work after you remove “com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec” from the “io.compression.codecs” property in the “core-site.xml” file you have referenced in your putHDFS processor. Thanks, Matt

MattWho · ‎11-03-2016

@Santiago Ciciliani Do you have any idea how many log lines per FlowFile? A suggested dataflow may look like this: The SplitText processor is used to breakup your incoming log files in to many smaller FlowFiles that can more easily be handled by the RouteText processor without running out of heap memory. This is done by setting the line split count property. Depending on how much heap you have configured for your NiFi and size size of each log line really determines how many logs line you can have per split FlowFile. The RouteText processor evaluates the entire FlowFiles content and routes groups of logs lines to a "dt" relationship: The UpdateAttribute processor (Optional) will create a "dt" attribute from the "RouteText.Group" attribute. YOu can use thsi attribute later to define the Hive partition table: The MergeContent processor (Optional) is used to combine FlowFiles with matching values (dates) in the "RouteText.Group" attribute back in to a single FlowFile. Don't forget to set the number of entries and max bin age properties to maximize this processors usage. Route the "Merged" relationship from this processor to your Hive based processor. Thanks, Matt

MattWho · ‎11-02-2016

@Paul Yang 1. There is an existing open Jira for being able to adjust the batch size of Site-to-Site. (https://issues.apache.org/jira/browse/NIFI-1202) 2. NiFi does not restrict how many RPGs can be added to the canvas. What is important to understand is that NiFi Nodes do not know about one another. Each runs the dataflow. When using RPGs to pull data from an output port, every node is running that RPG and every node is requesting FlowFiles. When one of those nodes connects the cluster informs that connecting instances that x number of FlowFile are currently queued to that output port and that Node will pull them all. so you get much better load-balance behavior forma push to an input port (yet still done in batches of 100). 3. Two suggestions come to mind: a. Reduce the configured "partition size" value in your GenerateTableFetch processor so more FlowFiles are generated which should then get better load balanced across you nodes. b. Instead of using S2S, build a load-balanced dataflow that is hard-coded to deliver data to each node as follows:

MattWho · ‎11-02-2016

@apsaltis I might suggest we make a few changes to this article: 1. The link you have for installing HDF talks about installing HDF 2.0. HDF 2.0 is based off Apache NiFi 1.0. Since MiNiFi is built from Apache NiFi 0.6.1, the dataflows built and templated for conversion into MiNiFi YAML files must also be built using an Apache 0.6 based NiFi install. (I see in your example above you did just that but this needs to be made clear) 2. I would never recommend setting nifi.remote.input.socket.host= to "localhost". When a NiFi or MiNiFi connects to another NiFi via S2S, the destination NiFi will return the value set for this property along with the value set for nifi.remote.input.socket.port=. In your example that means the source MiNiFi would then try to send FlowFiles to localhost:10000. This is ONLY going to work if the destination NIFi is located on the same server as MiNiFi. 3. You should also explain why you are changing nifi.remote.input.secure= from true to false. Changing this is not a requirement of MiNiFi, it is simply a matter of preference (If set to true, both MiNiFi (source) and NiFi (destination) must be setup to run securely over https). In your example you are working with http only. 4. While doable, one should never route the "success" relationship from any processor back on to itself. If you have reached the end of your dataflow, you should auto-terminate the "success" relationship. 5. I am not clear what you are telling me to do based on this line under step 5: Start the From MiNiFi Input Port 6. When using the GenerateFlowFile processor in an example flow it is important to recommend that user set a run schedule other then "0 sec". Since MiNiFi is Apache 0.6.1 based there is no default backpressure on connections and with a run schedule of "0 sec" it is very likely this processor will produce FlowFiles much faster then they can be sent across S2S. This will eventual fill the hard drive of the system running MiNiFi. An even better recommendation would be to make sure they set back pressure between the GenerateFlowFile processor and the Remote Process Group (RPG). That way even if someone stops the NiFi and not the MiNiFi they don't fill their MiNiFI hard drive. Thanks, Matt

MattWho · ‎11-02-2016

@Obaid Salikeen The hardware requirements as far as number of CPU and Ram are very dependent upon the nature of the dataflow you design and implement as well as the throughput you want to achieve. While the "Hardware Sizing Recommendations" your linked is a good starting point, I do believe the memory allocations suggested there are low, but again that is subject to your dataflow design intentions. Some processor components are CPU, Disk I/O, Memory, or all of the above intensive. Some of those components may even exhibit different load characteristics depending on how they are configured. Have you considered using the latest HDF 2.0.1? It gets rid of the NCM (Master) in your cluster. If not, your NCM for an 8 node cluster will likely need more memory for heap then your nodes. My suggestion would be to do your development work based upon the recommendations above (with additional memory Min 8GB - 16 GB). After you have a designed and tested flow, you will be able to see the impact on your hardware and adjust accordingly for your testing phase before production. Thanks, Matt

Member Since	‎07-30-2019 10:41 AM
Last Visited
Posts	3,133
Kudos received	1560

Cloudera Community

Re: Flowfile stuck in Wait in EnforceOrder process...

Re: Untrusted proxy error Authentication Failed o....

Re: REST API Configuration for NiFi 2.0

Re: Fileflow penalized for certain time before all...

Re: Nifi : Implement Sleep Mechanism in nifi witho...

Re: HDF 2.0 are all MPack services required?

Re: Does HDF 2.0 requires its own ZK quorum?

Re: Should HDF 2.0 ZK nodes be isolated or colocat...

Re: Does HDF 2.0 requires its own ZK quorum?

Re: What is nifi.cluster.is.node param used for?

Re: Failed to write to parent HDFS directory.

Re: Nifi partition file by date

Re: How to fetch rows from a table in parallel wh...

Re: Getting Started with MiNiFi

Re: Hardware recommendation for HDF/Nifi cluster