About mburgess

mburgess · ‎07-19-2016

Hans, There is an email thread that talks about how to (currently) get EvaluateXPath to work with namespaces. The email talks about default namespaces but it works for explicit namespaces too. If multiple namespaces have the same "local name", that can cause problems however. There is a Jira case to add support for namespaces in XPath/XQuery processors: https://issues.apache.org/jira/browse/NIFI-1023

mburgess · ‎07-18-2016

The error in the bulletin is unfortunately not very descriptive, can you check the logs for the cause of that exception?

mburgess · ‎07-17-2016

Here's my answer from StackOverflow: There is a subproject of Apache NiFi called MiNiFi, which (among other things) aims to have agents on devices and such in order to collect data at its point of creation. This will include native agents, thus a JVM will not be required. The proposed roadmap is here, it mentions the development of native agent(s).

mburgess · ‎07-15-2016

You could use GetFile -> SplitText -> ExtractText -> InvokeHttp: GetFile gets the configuration file, set "Keep source file" to true and schedule it to run once a day SplitText splits the file into multiple flow files, each containing a single line/URL ExtractText can put the contents of the flow file into an attribute (called "my.url" for example) InvokeHttp can be configured to use an Expression Language construct for the URL property (such as "${my.url}")

mburgess · ‎07-14-2016

QueryDatabaseTable would require a "last modified" column in the table(s) in order to detect updates, and probably a "logical delete" flag (i.e. boolean column) in that (or a helper) table in order to detect deletes. This is similar to what Apache Sqoop does. If you have the Enterprise Edition of SQL Server, you may be able to enable their Change Data Capture feature. Then, for incremental changes, you can use QueryDatabaseTable against the "CDC Table" rather than your source tables. For strict migration (no incremental fetching of updates) of multiple tables in SQL Server, if you can generate individual flow files, each containing an attribute such as "table.name", then you could parallelize across a NiFi cluster by sending them to ExecuteSQL with the query set to "SELECT * FROM ${table.name}". In this case each instance of ExecuteSQL will get all the rows from one table into an Avro record and send it along the flow. Regarding MongoDB, I don't believe the MongoDB processors support incremental fetching. QueryDatabaseTable might work on flat documents, but there is a bug that prevents nested fields from being returned, and aliasing the columns won't work for the incremental fetch part. However ExecuteSQL will work if you explicitly list (and alias) the document fields in the SQL statement, but that won't do incremental fetch. You might be able to use Sqoop for such things, but there are additional requirements if using sqoop-import-all-tables, and if doing incremental fetch you'd need 250 calls to sqoop import. Do your tables all have a "last modified" column or some sort of similar structure? Supporting distributed incremental fetch for arbitrary tables is a difficult problem as you'd need to know the appropriate "last modified" column for each table (if they're not named the same and/or present in every table). When tables all behave the same way from an update perspective, it makes this problem much easier.

mburgess · ‎07-07-2016

In EvaluateJsonPath, you can choose "flowfile-attribute" as the Destination, then the original JSON will still be in the flow file content, and any extracted JSON elements will be in the flowfile's attributes. That can go into RouteOnAttribute for "eventname". Then you can use ReplaceText (or ExecuteScript if you prefer) to create a CQL statement using Expression Language to insert the values from your attributes, or to wrap the entire JSON object in a CQL statement. I have a template that uses ReplaceText to put an entire JSON object into a "INSERT INTO myTable JSON" CQL statement, it is available as a Gist (here). It doesn't have a PutCassandraQL processor at the end, instead its a LogAttribute processor so you can see if the CQL looks right for what you're trying to do.

mburgess · ‎06-30-2016

In curl the -d means to put the data into the request body. The InvokeHttp processor does not send the contents of the flow file for GET requests, only PUT or POST. However the Elasticsearch Search/Query API accepts GET, so this approach probably won't work. What you may be looking for is the URL Search API, I commented on that in another thread, but will post here too. Using this method, you can put your query in the URL itself. Note that the query parameters look a bit different because it's not JSON, they are HTTP query parameters. In your example you are matching all documents (which is the default I believe) so http://localhost:9200/tweet_library/_search?size=10000 should be all you need for that case. To explicitly match all documents, you can use the q parameter: http://localhost:9200/tweet_library/_search?size=10000&q=*:* There are quite a few query options available with the URL Search API, please see the Elasticsearch documentation for more information.

mburgess · ‎06-29-2016

Your curl command uses the -d parameter, which means it's sending that JSON in the body of the request. To do that with InvokeHttp, you could have a GenerateFlowFile -> ReplaceText processor before the InvokeHttp, where ReplaceText would set the body to the query you have above. Alternatively you could use the URL Search API for Elasticsearch. In your example you are matching all documents (which is the default I believe) so http://localhost:9200/tweet_library/_search?size=10000 should be all you need for that case. To explicitly match all documents, you can use the q parameter: http://localhost:9200/tweet_library/_search?size=10000&q=*:*

mburgess · ‎06-28-2016

As @Pierre Villard mentioned, FetchElasticsearch should not require an incoming connection. This has been captured as part of NIFI-1576. However to extract all documents from a particular Index and (optional) Type, you'll need the Search API, but FetchElasticsearch uses the Get API. To use the Search API, you can use the InvokeHttp processor with your own search query. Please see this related HCC post: https://community.hortonworks.com/questions/41951/how-to-get-all-values-with-expression-language-in.html

mburgess · ‎06-27-2016

The FetchElasticsearch processor uses the Get API, which requires a single document identifier and doesn't support regular expressions. As an alternative, you can use InvokeHttp to call the Multi-Get API or the Search API, which give you more control over the retrieval of multiple documents.

Online	Offline
Last Visited	‎11-07-2024 11:28 PM

Member Since	‎11-16-2015 02:21 PM
Last Visited	‎11-07-2024 11:28 PM
Posts	892
Kudos received	642

Cloudera Community

Re: Nifi Building error when creating a brand new ...

Re: Tuning PutHive3Streaming NiFi processor

Re: NiFi ExecuteScript - Able to add attributes to...

Re: NiFi - JOLT assign value to attribute from Jso...

Re: NiFi - ExecuteScript for getting max value of ...

Re: Nifi XML Namespace

Re: SelectHiveQL Fails on JDBC Error

Re: Apache Nifi on QNX

Re: What is the best way to iterate a list of urls...

Re: Is it a best practice to use nifi querydatabas...

Re: Filter kafka Json events based on eventname an...

Re: Using InvokeHTTP, FlowFile, and ReplaceText fr...

Re: How does the FetchElasticSearch Processor Work...

Re: How does the FetchElasticSearch Processor Work...

Re: How to get all values with expression language...