About mburgess

mburgess · ‎12-15-2016

Currently the remove() functionality is used by particular processors such as DetectDuplicate, GetHBase, etc. and is not exposed via a RemoveDistributedMapCache processor or anything like that. I have written an article that provides and describes a Groovy script I wrote to interact with the DistributedMapCacheServer from the command line, does that suit your needs?

mburgess · ‎12-15-2016

In NiFi/HDF, it is possible to create a kind of lookup table of key/value pairs using a DistributedMapCacheServer. The DistributedMapCacheServer is used by various processors such as GetHBase and DetectDuplicate. It can also be leveraged by users in their flows, with the PutDistributedMapCache and FetchDistributedMapCache processors, by specifying a corresponding DistributedMapCacheClientService. Sometimes, however, it might be the case that the user would like to interact with the DistributedMapCacheServer programmatically (and external to NiFi), say for removing specific entries, inserting/populating entries, etc. To that end I have written a Groovy script (dcachegroovy.txt, rename to dcache.groovy) to allow manipulation of the entries in a DistributedMapCacheServer from the command-line, e.g. The usage is as follows: Usage: groovy dcache.groovy <hostname> <port> <command> <args> Where <command> is one of the following: get: Retrieves the values for the keys (provided as arguments) remove: Removes the keys (specified as arguments) put: Sets the given keys to the given values, specified as arguments in the form: key1 value1 key2 value2 ... keyN valueN So to insert entries "a = Hello" and "b = World" (assuming a local DistributedMapCacheServer at the default port), you can enter: groovy dcache.groovy localhost 4557 put a Hello b World Which outputs the following: Set a = Hello Set b = World Then to retrieve the values: groovy dcache.groovy localhost 4557 get a b Which gives: a = Hello b = World To remove an entry: groovy dcache.groovy localhost 4557 remove b Giving: Removed b Trying the get again for both values (where b no longer exists): groovy dcache.groovy localhost 4557 get a b Gives: a = Hello b = This script can be used to pre-populate, clear, or inspect a DistributedMapCacheServer. I'd be interested to hear if you try it, whether you find it useful or not, and of course all suggestions for improvements are welcome. Cheers!

mburgess · ‎12-14-2016

The approach in the other thread is very inefficient for this use case. You're basically trying to do a join between rows in a file and rows in a DB table. An alternative is to populate a DistributedMapCacheServer from the DB table, then look up those values in a separate flow. To populate the map, you could do something like this: Here I am using QueryDatabaseTable with a Max Value Column of "id" such that the map will only be populated once. But if you are adding entries to the lookup table (as it appears you might be from your description) or if new entries will not have strictly greater values for "id", then you can remove the Max Value Column property and schedule the QueryDatabaseTable processor to run as often as you'd like to refresh the values. Once this flow is running, you can start a different flow that is similar to the one in the other thread, but instead of querying the DB for each row in the file, it will fetch from the DistributedCacheMapServer, which is hopefully faster: You can see the first part is the same as the flow in the other thread, but instead of using ReplaceText to generate SQL to execute, the value is simply looked up from the Map and put into an attribute, then the final ReplaceText is like the one in the other thread, specifying "${column.1},${column.2},${column.3},${column.4}, ${customer.name}" or whatever the appropriate attributes are. I have attached a template (databaselookupexample.xml) showing these two flows.

mburgess · ‎12-14-2016

Yes you can add a dynamic property whose value is a regular expression (see the documentation for more details).

mburgess · ‎12-14-2016

The UI uses the REST API so you can do it programmatically the same way: /flowfile-queues/{id}/drop-requests

mburgess · ‎12-13-2016

Yes, you can use something like the regex from step 2 above in a RouteOnContent processor, or after the ExtractText (step 2 above) you can use RouteOnAttribute looking for values of column.2.

mburgess · ‎12-08-2016

The documents at the link above are for Apache NiFi 1.1.0, but HDF 2.0.0 was built with NiFi 1.0.0. The ability to append was added to NiFi 1.1.0 under NIFI-1322, so will likely be available in an upcoming version of HDF. The docs at that site are always for the latest version of Apache NiFi, it is recommended that you use the docs that come with your version of HDF/NiFi, using the Help option from the top-right hamburger menu in your running instance of HDF/NiFi.

mburgess · ‎12-08-2016

Yeah we should probably trim that URL before using it, please feel free to write a Jira for that if you like.

mburgess · ‎12-08-2016

So if the customer_name value for id=CCCDD was "Matt" then you'd like the first output row to read: XXXXX, BBBBB, CCCCC, CCCDD, Matt Is that correct? If so, you could do the following: Use SplitText to split the incoming CSV into one flow file per line ExtractText to store the four column values as attributes (example template called Working_With_CSV here), let's assume the attribute for the fourth column is called "column.4" ReplaceText to set the content of the flow file to a SQL statement "select customer_name from table where id=${column.4} limit 1" ExecuteSQL to execute the statement ConvertAvroToJson to get the record into JSON (for further processing) EvaluateJsonPath to get the value of customer_name into an attribute (named "customer.name" with a JSON Path of $[0].customer_name or something like that) ReplaceText to set the row back to the original columns plus the new one, with something like "${column.1},${column.2},${column.3},${column.4}, ${customer.name}" (optional) MergeContent to join the rows back together (if you need them as one file)

mburgess · ‎12-08-2016

Looks like your connect URL has a space as the first character? For the second URL, that might be one of two things. The first (less likely) is that you want your client to use hostnames to resolve name nodes (see my answer here). However I would've expected an error message like the one in that question, not the one you're seeing. I think for your second URL issue the problem is that Apache NiFi (specifically the Hive processors) doesn't necessarily work with HDP 2.5 out of the box, because Apache NiFi ships with Apache components (such as Hive 1.2.1), whereas HDP 2.5 has slightly different versions. I would try Hortonworks Data Flow (HDF) rather than Apache NiFi, as the former ships with HDP versions for Hive.

Online	Offline
Last Visited	‎10-29-2025 09:50 PM

Member Since	‎11-16-2015 02:21 PM
Last Visited	‎10-29-2025 09:50 PM
Posts	905
Kudos received	658

Cloudera Community

Re: Compare data within the JSON using NIFI

Re: how to join three csv files like sql on condit...

Re: How to see the Data Provenance and Lineage in ...

Re: Apache NiFi - RouteText has no matches

Re: Nifi Building error when creating a brand new ...

Re: How to delete from NiFi DistributedMapCache

Working with a NiFi DistributedMapCache

Re: ExecuteSQL query is very slow

Re: ExecuteSQL dynamic query..

Re: Finding and Purging Flow Files

Re: ExecuteSQL dynamic query..

Re: NiFI PutHDFS Append is not available.

Re: PutHiveQL Exception

Re: ExecuteSQL dynamic query..

Re: PutHiveQL Exception