About MattWho

MattWho · ‎09-29-2016

@Breandán Mac Parland If you are looking for a way to generate a file with the name "_SUCCESS", you can use the GenerateFlowFile processor to generate a file with random data as its content. You can the use an UpdateAttribute processor to set the filename to "_SUCCESS" by adding a new property with a property name equal to "filename" and a value of "_SUCCESS". Matt

MattWho · ‎09-27-2016

@Parag Garg NiFi can certainly handle dataflow with excess of 123 processors and well in excess of the number of FlowFiles you have here. Different processors exhibit different resource (CPU, Memory, and disk I/O) strain on your hardware. In addition to processors having an impact on memory, so do FlowFiles themselves. FlowFiles are a combination of the Physical content (stored in the NiFi content Repository) and FlowFile Attributes (Metadata associated to the content stored in heap memory). You can experience heap memory issues if your FlowFiles have very large attributes maps. (for example extracting the large amounts of content into attributes.) The first step is identifying which processor(s) in your flow are memory intensive resulting in your OutofMemoryError. Processors such as SplitText, SplitXML, and MergeContent can use a lot of heap if they are producing a lot of split files from a single file or merging a large number of files in to a single file. Th reason being is the merging and splitting is happening in memory until resulting FlowFile(s) are committed to the output relationship. There are ways of handling this resource exhaustion via dataflow design. (for example, merging a smaller number of files multiple times (using multiple MergeContent processors) to produce that one large file or splitting files multiple times (using multiple Split processors). Also be mindful of the number of concurrent tasks assigned to these memory intensive processors. Running with 4 GB of heap is good, but depending on your dataflow, you may find yourself needing 8 GB or more of heap to satisfy the demand created by your dataflow design. Thanks, Matt

MattWho · ‎09-26-2016

@Timothy Spann Do you see anything in the nifi-bootstrap.log?

MattWho · ‎09-26-2016

@Felix Duterloo The Read/Write stats on the face of each processor tell you how much content is being read from or written to NiFi's content repository. It is not intended to tell you how much data is written to the next processor or some external system. The purpose is to help dataflow managers understand which processors in their dataflows are disk I/O intensive. The "out" stat tells you how many FlowFiles were routed out of this processor on to one or more output relationships. In the case of a processor like putHDFS, it is typical to auto-terminate the "success" relationship and Loop the "failure" relationship back on to the putHDFS processor itself. Any FlowFile routed to "success" has confirmed delivery to HDFS. FlowFiles routed to failure were unable to be delivered and should produce a bulletin and log messages as to why. If the failure relationship is looped back on your putHDFS, NiFi will try again to deliver the file after the FlowFile penalty has expired. Matt

MattWho · ‎09-22-2016

@mayki wogno When asking new questions unrelated to the current thread, please start a new Community Connection question. This benefits the community at large who may be searching for answers to the same question.

MattWho · ‎09-21-2016

There is the possibility that the time could differ slightly (ms) between when both now() functions are called in that expression language which could cause the result to push pack to 11:58:59. To avoid this you can simply reduce 43260000 by a few milliseconds (43259990) to ensure that does not happen so 11:59:00 is always returned.

MattWho · ‎09-21-2016

@Sree Venkata You can do this using a combination of NiFi Expression Language (EL) functions: ${now() :minus(${now():mod(86400000)}) :minus(43260000) :format('MM-dd-yyyy hh:mm:ss') } This EL statement takes now subtracts the remainder resulting from dividing now by 86400000 (number of milliseconds in 24 hours) and then subtracts an additional 43260000 (12 hours and 1 minute) from that result and finally formatting the output in the date format you are looking for. I confirmed this EL statement by using it in an UpdateAttribute processor: and if I look at the attributes on a FlowFile that was processed by the above, I see: You can see that the attribute "yesterday" is set to exactly one day earlier from "current time" and 11:59:00. Thanks, Matt

MattWho · ‎09-21-2016

@mayki wogno Nodes in NiFi cluster do not share data. Each works on the very specific data it has received through some ingest type NiFi processor. As such, each node has its own repositories for storing that specific data (FlowFile content) and the metadata (FlowFile attributes) about that node specific data. As a cluster every node loads and runs the exact same dataflow. One and only one node in the cluster can be the "primary node" at any given time. Some NiFi processors are not cluster friendly and as such should only run on one node in the cluster at any given time. (GetSFTP is a good example) NiFi allows you to configure those processor with a "on primary node" only scheduling strategy. While these processors will still exist on every node in the cluster they will will only run on the primary node. If the primary node designation in the cluster should change at any time, the cluster takes care of stopping the "on primary node" scheduled processors on the original primary node and staring them on the new primary node. When a node goes down, the other nodes in the cluster will not pickup working on the data that was queued on that down node at this time. That Node as Bryan pointed out will pick up where it left off on its queued data once restored to an operational state provide there was no loss/corruption to either the content or FlowFile repositories on that specific node. Thanks, Matt

MattWho · ‎09-21-2016

If the node that goes down happens to be the "primary node", the cluster will automatically elect a new "primary node" from the remaining available nodes and start those "primary node only" processors.

MattWho · ‎09-21-2016

@Gerd Koenig After a closer look at the jaas file you posted above, I believe you issue is because of a missing " in the following line: principal="nifi@REALM; This line should actually be: principal="nifi@REALM"; Try making the above change and restarting your NiFi. Thanks, Matt

Member Since	‎07-30-2019 10:41 AM
Last Visited
Posts	3,146
Kudos received	1562

Cloudera Community

Re: Cloudera NiFi - Automatic policy creation

Re: MergeRecord generates multiple files

Re: Flowfile stuck in Wait in EnforceOrder process...

Re: Untrusted proxy error Authentication Failed o....

Re: REST API Configuration for NiFi 2.0

Re: Create File Using NiFi

Re: NIFI java.lang.OutOfMemoryError: Java heap spa...

Re: HDF 2.0 Hanging on Restart

Re: putHDFS no output - logs?

Re: HDF 2.0 Cluster. How manager dataflow and fail...

Re: nifi - how to derive yesterday 's date like 09...

Re: nifi - how to derive yesterday 's date like 09...

Re: HDF 2.0 Cluster. How manager dataflow and fail...

Re: HDF 2.0 Cluster. How manager dataflow and fail...

Re: standalone NiFi + putKafka into kerberized Kaf...