About bbende

bbende · ‎07-11-2016

If you use NiFi you can use the ListHDFS + FetchHDFS processors to monitor an HDFS directory for new files. From there you have two options to index the documents... 1) As Sunile mentioned you could write a processor that extracts the information using Tika and then send that to PutSolrContentStream processor. There is going to be a new ExtractMediaMetadata processor in the next release, but it doesn't extract the body content, so you would likely need to implement your own processor. 2) You could send the documents (PDFS, emails, word) right from FetchHDFS to PutSolrContentStream, and configure PutSolrContentStream to use Solr's extracting request handler which uses Tika behind scenes: https://community.hortonworks.com/articles/42210/using-solrs-extracting-request-handler-with-apache.html

bbende · ‎07-07-2016

That is really interesting that NiFi is showing up in that list. I was assuming you had added the unofficial NiFi service definition, but I had no idea it was in the sandbox by default. I would be curious to know where that is coming from.

bbende · ‎07-07-2016

There is no official NiFi service for Ambari yet. It is recommended to download the NiFi tar/zip on the sandbox, extract it and change the web port in nifi.properties to something besides 8080, and forward that port through the VM.

bbende · ‎07-06-2016

Could provide a screen shot of your NiFi flow showing the output port?

bbende · ‎07-06-2016

The error you provided is saying that the AmbariReportingTask you have running can not connect to the Ambari Metrics Collector. This error will not prevent anything in your flow from working and should have no effect on getting a Twitter example working, it will only stop metrics from going to Ambari. Please verify the Ambari Metrics Collector is running at the host and port specified on the reporting task in the Metrics Collector URL property, the default value is http://localhost:6188/ws/v1/timeline/metrics

bbende · ‎07-05-2016

I personally haven't used the PutCassandraQL processor so @Matt Burgess may know more than me, but I think it expects the content of the FlowFile to contains a CQL statement using ? to escape the parameters. So lets say your insert statement is, you could use ReplaceText processor and set the Replacement Value property to something like this: INSERTINTO mytable (field1, field2) VALUES (?, ?) Before getting to that processor you would need to setup the following FlowFile attributes: cql.args.1.value = value1 cql.args.1.type = string cql.args.2.value = value2 cql.args.2.type = string The value attributes could come from EvaluateJsonPath extracting the appropriate values, and the type attributes could be added by UpdateAttribute. So your full flow might be: GetKafka -> EvaluateJsonPath -> RouteOnAttribute -> UpdateAttribute -> ReplaceText -> PutCassandraQL. There is an example NiFi template for working with Cassandra here: https://github.com/hortonworks-gallery/nifi-templates/blob/master/templates/CassandraProcessors.xml I have not used it before so I can't say what it demonstrates, but it might be helpful to look at.

bbende · ‎07-05-2016

Both approaches could work. This type of filtering task is well suited for NiFi and you could likely use the EvaluateJsonPath and RouteOnAttribute processors to perform the filtering, and the PutCassandraQL processor to insert to Cassandra, and not have to write any code.

bbende · ‎06-28-2016

The PutSolrContentStream processor in Apache NiFi makes use of Solr's ContentStreamUpdateRequest which means it can stream arbitrary data to Solr. Typically this processor is used to insert JSON documents, but it can be used to stream any kind of data. The following tutorial shows how to use NiFi to stream data to Solr's Extracting Request Handler. Setup Solr Download the latest version of Solr (6.0.0 for writing this) and extract the distribution Start Solr with the cloud example: ./bin/solr start -e cloud -noprompt Verify Solr is running by going to http://localhost:8983/solr in your browser Setup NiFi Download the latest version of NiFi (0.6.1) and extract the distribution Start NiFi: ./bin/nifi.sh start Verify NiFi is running by going to http://localhost:8080/nifi in your browser Create a directory under the NiFi home for listening for new files: cd nifi-0.6.1 mkdir data mkdir data/input Create the NiFi Flow Create a simple flow of GetFile -> PutSolrContentStream -> LogAttribute: The GetFile Input Directory should be ./data/input corresponding the directory created earlier. The configuration for PutSolrContentStream should be the following: The Solr Type is set to Cloud since we started the cloud example The Solr Location is the ZooKeeper connection string for the embedded ZK started by Solr The Collection is the example gettingstarted collection created by Solr The Content Stream Path is the path of the update handler in Solr used for extracting text, this corresponds to a path in solrconfix.ml The Content-Type is application/octet-stream so we can stream over any arbitrary data The extracting request handler is described in detail here: https://wiki.apache.org/solr/ExtractingRequestHandler We can see that a parameter called "literal.id" is normally passed on the URL. Any user defined properties on PutSolrContentStream will be passed as URL parameters to Solr, so by clicking the + icon in the top-right we can add this property and set it to the UUID of the flow file: Ingest & Query At this point we can copy any document into <nifi_home>/data/input and see if Solr can identify it. For this example I copied quickstart.html file from the Solr docs directory. After going to the Solr Admin UI and querying the "gettingstarted" collection for all documents, you should see the following results: We can see that Solr identified the document as "text/html", extracted the title as "Solr Quick Start", and has the id as the UUID of the FlowFile from NiFi. We can also see the extraction was done using Tika behind the scenes. From here you can send in any type of documents, PDF, Word, Excel, etc., and have Solr extract the text using Tika.

bbende · ‎06-28-2016

NiFi is not built on top of hadoop and therefore does not use MapReduce or any other processing platform. NiFi is a dataflow tool for moving data between systems, performing simple event processing, routing and transformations. Each node in a NiFi cluster runs the same flow, and it is up to the designer of the flow to partition the data across the NiFi cluster. This presentation shows strategies for how to divide the data across your cluster: http://www.slideshare.net/BryanBende/data-distribution-patterns-with-apache-nifi This presentation has an architecture diagram of what a cluster looks like with the internal repositories (slide 17): http://www.slideshare.net/BryanBende/nj-hadoop-meetup-apache-nifi-deep-dive

bbende · ‎06-28-2016

There is a GetHBase processor that is made to incrementally extract data from an HBase table by keeping track of the last timestamp seen, and finding cells where the timestamp is greater than the last time seen. There is an open JIRA to create another processor that might be called ScanHBase where it is not based on the timestamps and would allow more general extraction.

Online	Offline
Last Visited	‎09-10-2020 01:23 PM

Member Since	‎09-29-2015 04:02 PM
Last Visited	‎09-10-2020 01:23 PM
Posts	871
Kudos received	709

Cloudera Community

Re: Using nifi registry in a nifi cluster.

Re: Is there a way to enable a stateful status upd...

Re: Automated Start/Stop of a NiFi Processor

Re: PublishKafkaRecord_0_10 1.2.0.3.0.1.1-5 Error:...

Re: how to configure mergecontent processor

Re: Solr indexing

Re: Adding NiFi Server to HDP 2.5 Sandbox

Re: Adding NiFi Server to HDP 2.5 Sandbox

Re: Could not find port with name 'portName' for r...

Re: Getting ERROR [Timer-Driven Process Thread-8] ...

Re: Kafka json events filter on event name & event...

Re: Kafka json events filter on event name & event...

Using Solr's Extracting Request Handler with Apach...

Re: How Nifi Works Internally

Re: What is the best practice for reading data fro...