About MattWho

MattWho · ‎07-13-2017

NiFi stores FlowFile Attributes in the FlowFile repo and FlowFile Content in the Content repo. NiFi knows which queue FlowFiles were in when if it is shutdown. This allows Nifi to reload these FlowFiles back in to those queues and pick up where the dataflow left off after a restart.

MattWho · ‎07-13-2017

@siva karna I am not following the statement "so there is an abstraction for the first process group flow file it will stop so we will loss the data". Why would stopping a dataflow cause data loss? NiFi will only read in new nars/jars added to a NiFi lib directory on startup. There is no option to dynamically add classes during runtime. Thanks, Matt

MattWho · ‎07-13-2017

@Akash S The ListHDFS processor records state so that only new files are listed. The processor also has a configuration option for recursing subdirectories. You could set the directory to only /MajorData/Location/ and let it list all files from the subdirectories. As new subdirectories are created, the files within those new directories will get listed. If that does not work for you, the NiFi expression language (EL) statement that you are looking for would look something like this for the directory: /MajorData/Location/${now():format('yyyy/MM/dd')} The above would cause Nifi to only look in the target directory fro Files until the day changed. I am not sure the rate at which files are written in to these target directories, but be mindful that if a file is add between runs of the listHDFS processor and the day changes between those runs, that file will not get listed using the above EL statement. Thanks, Matt

MattWho · ‎07-12-2017

@M R I find the following very useful when trying to build Java regular expressions: http://myregexp.com The Java regular expression: ^(.*?)%%(.*?)%%(.*?)%%(.*?)%%(.*?),(.*?)%%(.*?)$ It has 7 capture groups that will result in: When you add a ew property to the extractText processor with a property name of "string" and use the above java regex. Of course if you are only looking for two capture groups, you could use the following regex instead: ^(.*?)%%.*?%%(.*?)%%.*?%%.*?,.*?%%.*?$ Thanks, Matt

MattWho · ‎07-11-2017

@Eric Lloyd I considered that as well at first, but went the other route as I could be sure my byte sequence would be unique no matter what the stack trace looked like. Since you are looking for a line return followed by 20 you may have an issue with the very fist line in your file. I would test that to confirm. Matt

MattWho · ‎07-11-2017

@Eric Lloyd Must be a by-product of the splitContent operation. It is reading the last line return before it sees the next bytes sequence. If the blank line becomes an issue, you can remove blank lines using a ReplaceText processor also. This will replace any line that starts with a line return with nothing. Thanks, Matt

MattWho · ‎07-11-2017

@Eric Lloyd Another option (not as nice as the GrokReader) is to use SplitContent instead of SplitText processor. So here I use the ReplaceText processor to date string format every log line starts with and prepend to that a unique string that i can use later to split the content. I then use the SplitText processor to split based on that unique string. This means that any stack trace that follows a log line will be captured with the preceding log entry. After that you can do what you want with the resulting splits. I chose to filter out the splits for ERROR or WARN log lines and auto-terminate everything else. Here is an example output of one of my log lines with a stack trace: 2017-07-11 10:21:38,087 ERROR [Timer-Driven Process Thread-2] o.a.n.p.attributes.UpdateAttribute java.lang.StringIndexOutOfBoundsException: String index out of range: 40 at java.lang.String.substring(String.java:1963) ~[na:1.8.0_77] at org.apache.nifi.attribute.expression.language.evaluation.functions.SubstringEvaluator.evaluate(SubstringEvaluator.java:55) ~[nifi-expression-language-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.attribute.expression.language.Query.evaluate(Query.java:570) ~[nifi-expression-language-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.attribute.expression.language.Query.evaluateExpression(Query.java:388) ~[nifi-expression-language-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.attribute.expression.language.StandardPreparedQuery.evaluateExpressions(StandardPreparedQuery.java:48) ~[nifi-expression-language-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.attribute.expression.language.StandardPropertyValue.evaluateAttributeExpressions(StandardPropertyValue.java:152) ~[nifi-expression-language-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.attribute.expression.language.StandardPropertyValue.evaluateAttributeExpressions(StandardPropertyValue.java:133) ~[nifi-expression-language-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.processors.attributes.UpdateAttribute.executeActions(UpdateAttribute.java:496) ~[nifi-update-attribute-processor-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.processors.attributes.UpdateAttribute.onTrigger(UpdateAttribute.java:377) ~[nifi-update-attribute-processor-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) ~[nifi-api-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1099) [nifi-framework-core-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136) [nifi-framework-core-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47) [nifi-framework-core-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:132) [nifi-framework-core-1.1.0.2.1.4.0-5.jar:1.1.0.2.1.4.0-5] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77] Thanks, Matt

MattWho · ‎07-11-2017

@adrian white At 90 MB, I suspect that CSV file has a lot of lines to split. Are you seeing any Out Of Memory errors in your nifi-app.log? To help reduce the heap usage here, you may want to try using two splitText processor in series. The first splitting every 1,000 - 10,000 lines and the second then splitting those by every line. NiFi FlowFile attributes are kept in heap memory space. NiFi has a mechanism for swapping FlowFile attributes to disk for queues, but this mechanism does not apply to processors. The SplitText processor holds the FlowFile attributes for every new FlowFile it is creating in heap until all resulting Split FlowFiles have been created. When splitting creates a huge number of resulting FlowFiles in a single transaction, you can run out of heap space. So by splitting the job between multiple splitText processors in series, you reduce the number of FlowFiles that are being generated per transaction thus decreasing heap usage. Thanks, Matt

MattWho · ‎07-10-2017

Just to add more detail to the above answer... - Granting users the ability to run provenance queries does to then give users the ability to view details on every piece of data that passes through any processor component on the canvas. - if you were to monitor the nifi-app.log on each of your nodes, you would likely see that the provence query is returning events yet none are being displayed. This is because NiFi filters the result based on "data" resource policies granted to that user. - Only results for components which the user has been granted access will be displayed. This is where the /data/{resource}/{uuid} mentioned above comes in to play here.

MattWho · ‎07-07-2017

What is the output of the following: netstat -ant |grep LISTEN

Online	Offline
Last Visited	‎11-19-2025 02:17 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎11-19-2025 02:17 PM
Posts	3,391
Kudos received	1614

Cloudera Community

Re: How to achieve inheritence within Parameter Co...

Re: Cannot access the NiFi Registry from NiFi and ...

Re: Error connecting to NiFi Registry from NiFi UI...

Re: using nifi as a kafka streaming- real-time str...

Re: using nifi as a kafka streaming- real-time str...

Re: should we need to restrart nifi everytime when...

Re: should we need to restrart nifi everytime when...

Re: FetchHDFS Process to fetch Nested data in HDFS

Re: Extract Text - Extracting delimited data

Re: Logging a Stack Trace event with Nifi

Re: Logging a Stack Trace event with Nifi

Re: Logging a Stack Trace event with Nifi

Re: Problem Using TailFile on a csv file and libsv...

Re: Provenance Data not shown for non-admin user. ...

Re: NiFi - Cannot assign requested address (Bind f...