About MattWho

MattWho · ‎08-16-2017

@nesrine salmene The Database repository consists of two H@ databases: nifi-user-keys.h2.db nifi-flow-audit.h2.db When NiFi is running you will see two additional lock files that correspond to these databases. The nifi-user-keys.h2.db is only used when NiFi has been secured and it contains information about who has logged in to NiFi. The same information here is also output to the nifi-user.log. You can parse the nifi-user.log to audit who has logged in to a particular NiFi instance. The nifi-flow-audit.h2.db is used by NiFi to keep track of all configuration changes made within the NiFi UI. The information contained in this DB is viewable via the "Flow Configuration History" embedded UI found under the Upper right corner hamburger menu in NiFi's UI: You can use NiFi's rest API to query the Flow Configuration History. Thanks, Matt

MattWho · ‎08-16-2017

@Wesley Bohannon Is this a NiFi standalone or a NiFi cluster? If cluster, are the FlowFiles being produced by each of your SelectHiveQL processors being produced on the same node? The MergeContent processor will not merge FlowFiles from different cluster nodes. Assuming that all FlowFiles are on same NiFi instance, the only way I could reproduce your scenario was: Each FlowFile had a different value assigned to the "table_name" FlowFile Attribute and Merge Strategy was set to "Bin-Packing Algorithm". This caused each FlowFile to be placed in its own bin. At the end of 5 minutes max bin age, each bin of 1 was merged. If the intent is always to merge one FlowFile from each incoming connection, what is the purpose of setting a "Correlation Attribute Name" Setting Maximum number of bins to 1 and having 4 source FlowFiles become queued at different times. The "Defragment" Merge Strategy will bin FlowFiles based on FlowFiles with matching values in the "fragment.identifier" FlowFile Attribute. It will then merge the flowFiles using the "fragment.index" and "fragment.count" attributes. Since you have also specified a correlation attribute, the MergeContent processor will instead use the value associated to that attribute instead of "fragment.identifier" to bin your files. If I have unique values on each FlowFile for "table_name", then each FlowFile ends up in a different bin and are routed to failure right away (if bins set to 1) or after 5 minutes max bin age since not all fragments where present. The other possibility is that "fragment.count" and "fragment.index" is set to 1 on every FlowFile. I would stop your MergeContent processor and allow 1 FlowFile to queue in each connection feeding it. Then use the "list queue" capability to inspect the attributes on each queued FlowFile. What values are associated to each FlowFile for the following attributes: fragment.identifier fragment.count fragment.index table_name Thank you, Matt

MattWho · ‎08-16-2017

@Pierre Leroy Splitting such a large file may result in Out Of Memory (OOM) errors in NiFi. NiFi must create every split FlowFile before committing those splits to the "splits" relationship. During that process NiFi holds the FlowFile attributes (metadata) on all those FlowFile being produced in heap memory space. What you image above shows is that you issue a stop on the processor. What this indicates is that you have stopped the processor scheduler form triggering again. The processor will still allow any existing running threads to complete. The small number "2" in the upper right corner indicates the number of threads still active on this processor. If you have run out of memory for example, this process will probably never complete. A restart of NiFi will kill off these threads When splitting very large files, it is common practice to use multiple splitText processors in series with one another. The first SplitText is configured to split the incoming files in to large chucks (say every 10,000 to 20,000 lines). The second SplitText processor then splits those chunks in to the final desired size. This greatly reduces the heap memory footprint here. Thanks, Matt

MattWho · ‎08-15-2017

@Hadoop User It is unlikely you will see the same performance out of Hadoop between reads and writes. The Hadoop Architecture is designed in such a way to favor multiple many readers and few data writers. Increasing the number of concurrent tasks may help but performance since you will then have multiple files being written concurrently. 1 - 2 KB files are very small and do not make optimal use of your Hadoop architecture. Commonly, NiFi is used to merge bundles of files together to a more optimal size for storage in Hadoop. I believe 64 KB is the default optimal size. You can remove some of the overhead of each connection by mergeing files together in to larger files using the MergeContent processor before writing to Hadoop. Thanks, Matt

MattWho · ‎08-15-2017

@Wesley Bohannon The issue you are most likely running in to is caused by only having 1 bin. https://issues.apache.org/jira/browse/NIFI-4299 Change number of bins to at least 2 and see if the resolves your issue. Thanks, Matt

MattWho · ‎08-15-2017

@Hadoop User I am unfortunately not a Hive or Hadoop Guru. Both errors above are being thrown by the Hive and Hadoop client libraries that these processors use and not by NiFi itself. Hopefully the above log lines are followed by full stack traces in the nifi-app.log. If not, try enabling DEBUG logging to see fi you can get a stack trace output. That stack trace may provide the necessary details to help diagnose what is causing the issue. Hopefully a Hive or Hadoop Guru will then be able to provide some assistance here. I also suggest providing the details for the HiveConnectionPool controller service you setup that is being used by this PutHiveQl processor. Thanks, Matt

MattWho · ‎08-15-2017

@Timothy Spann I am not sure what you mean by "Not updating". Also what Provenance implementation are you using (Persistent, WriteAhead, or Volatile)?

MattWho · ‎08-15-2017

@Hadoop User Please share your PutHDFS processor configuration with us. How large are the individual files that are being written to HDFS? Thanks, Matt

MattWho · ‎08-04-2017

@J. D. Bacolod Those processors were added for specific uses cases such as yours. You can accomplish the same thing almost using the putDistributedMapCache and FetchDistributeMapCache processors along with an UpdateAttribute processor. I used the UpdateAttribute processor to set a unique value in a new attribute named "release-value". In my case the value is assigned it was: The FetchDistributedMapCache processor then acts as the wait processor did looping FlowFile in the "not-found" relationship until the corresponding value is found in the cache. The "release-value" is written to the cache using the PutDistributedMapCache processor down the other path after the InvokeHTTP processor. It will receive the "Response" relationship. Keep in mind, the FetchDistributedMapCache processor does not have an "expire" relationship. If a response if never received for some FlowFile or the cache expired/evicted the needed value, those FlowFiles will loop forever. You can solve this two ways: 1. Set File Expiration on the connection containing the"not-found" relationship that will purge files that have not found a matching key value in the cache by the time the FlowFile's age has reached x value. With this option aged data is just lost. 2. Build a FlowFile expire loop which kicks these looping not-found FlowFiles out of loop after x amount of time so they can be handled by other processors. This can be done using the "Advanced" UI of an UpdateAttribute processor and a RouteOnAttribute processor: The UpdateAttribute sets a new attribute I called "initial-date" if and only if it has not already been set on the FlowFile. This can be done as follows using the "Advanced" UI of the UpdateAttribute processor : The RouteOnAttribute Processor then compares the current date plus x milliseconds to that attribute's value to see if file has been looping for more the x amount of time. (Using 6 minutes (360000 ms) as an example, my RouteOnAttribute would have a property/routing rule like this: FlowFiles that have been looping for 360000 milliseconds or more will then get routed to "expired" relationship where you can choose what you want to do with them. As you can see the processors wrap the above flow up in only two processors versus 5 processors you would need in older versions to get same functionality. Thanks, Matt

MattWho · ‎08-03-2017

@J. D. Bacolod The use case you describe is an exact fit for the "Wait" and "Notify" processors introduced in HDF 3.0/Apache NiFi 1.2.0. Using these processor would work as follows: The input (original FlowFile) is routed to both a Wait processor and your exiting flow. The "Response" relationship from your InvokeHTTP processor would route to the corresponding Notify processor. The Copy of the FlowFile that was routed to the Wait processor will continuously loop in the "wait" relationship until a release signal identifier for the FlowFile is written to a DistirbutedMapCache service by the Notify processor. Thanks, Matt

Online	Offline
Last Visited	‎11-20-2025 01:29 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎11-20-2025 01:29 AM
Posts	3,391
Kudos received	1614

Cloudera Community

Re: How to achieve inheritence within Parameter Co...

Re: Cannot access the NiFi Registry from NiFi and ...

Re: Error connecting to NiFi Registry from NiFi UI...

Re: using nifi as a kafka streaming- real-time str...

Re: using nifi as a kafka streaming- real-time str...

Re: Database repository nifi

Re: Nifi MergeContent Not Merging

Re: Nifi SplitText Big File

Re: puthdfs is writing slow

Re: Nifi MergeContent Not Merging

Re: puthiveql error

Re: HDF 3.0: NiFi: Data Provenance Not Updating

Re: puthdfs is writing slow

Re: Output a FlowFile in a Process Group, triggere...

Re: Output a FlowFile in a Process Group, triggere...