Member since
09-29-2015
871
Posts
723
Kudos Received
255
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3349 | 12-03-2018 02:26 PM | |
2302 | 10-16-2018 01:37 PM | |
3619 | 10-03-2018 06:34 PM | |
2395 | 09-05-2018 07:44 PM | |
1814 | 09-05-2018 07:31 PM |
07-11-2016
02:52 PM
1 Kudo
If you use NiFi you can use the ListHDFS + FetchHDFS processors to monitor an HDFS directory for new files. From there you have two options to index the documents... 1) As Sunile mentioned you could write a processor that extracts the information using Tika and then send that to PutSolrContentStream processor. There is going to be a new ExtractMediaMetadata processor in the next release, but it doesn't extract the body content, so you would likely need to implement your own processor. 2) You could send the documents (PDFS, emails, word) right from FetchHDFS to PutSolrContentStream, and configure PutSolrContentStream to use Solr's extracting request handler which uses Tika behind scenes: https://community.hortonworks.com/articles/42210/using-solrs-extracting-request-handler-with-apache.html
... View more
07-07-2016
08:27 PM
That is really interesting that NiFi is showing up in that list. I was assuming you had added the unofficial NiFi service definition, but I had no idea it was in the sandbox by default. I would be curious to know where that is coming from.
... View more
07-07-2016
08:16 PM
2 Kudos
There is no official NiFi service for Ambari yet. It is recommended to download the NiFi tar/zip on the sandbox, extract it and change the web port in nifi.properties to something besides 8080, and forward that port through the VM.
... View more
07-06-2016
04:03 PM
Could provide a screen shot of your NiFi flow showing the output port?
... View more
07-06-2016
01:55 AM
The error you provided is saying that the AmbariReportingTask you have running can not connect to the Ambari Metrics Collector. This error will not prevent anything in your flow from working and should have no effect on getting a Twitter example working, it will only stop metrics from going to Ambari. Please verify the Ambari Metrics Collector is running at the host and port specified on the reporting task in the Metrics Collector URL property, the default value is http://localhost:6188/ws/v1/timeline/metrics
... View more
07-05-2016
04:03 PM
I personally haven't used the PutCassandraQL processor so @Matt Burgess may know more than me, but I think it expects the content of the FlowFile to contains a CQL statement using ? to escape the parameters. So lets say your insert statement is, you could use ReplaceText processor and set the Replacement Value property to something like this: INSERTINTO mytable (field1, field2) VALUES (?, ?) Before getting to that processor you would need to setup the following FlowFile attributes: cql.args.1.value = value1
cql.args.1.type = string
cql.args.2.value = value2
cql.args.2.type = string
The value attributes could come from EvaluateJsonPath extracting the appropriate values, and the type attributes could be added by UpdateAttribute. So your full flow might be: GetKafka -> EvaluateJsonPath -> RouteOnAttribute -> UpdateAttribute -> ReplaceText -> PutCassandraQL. There is an example NiFi template for working with Cassandra here: https://github.com/hortonworks-gallery/nifi-templates/blob/master/templates/CassandraProcessors.xml I have not used it before so I can't say what it demonstrates, but it might be helpful to look at.
... View more
07-05-2016
02:21 PM
2 Kudos
Both approaches could work. This type of filtering task is well suited for NiFi and you could likely use the EvaluateJsonPath and RouteOnAttribute processors to perform the filtering, and the PutCassandraQL processor to insert to Cassandra, and not have to write any code.
... View more
06-28-2016
10:21 PM
6 Kudos
The PutSolrContentStream processor in Apache NiFi makes use of Solr's ContentStreamUpdateRequest which means it can stream arbitrary data to Solr. Typically this processor is used to insert JSON documents, but it can be used to stream any kind of data. The following tutorial shows how to use NiFi to stream data to Solr's Extracting Request Handler. Setup Solr Download the latest version of Solr (6.0.0 for writing this) and extract the distribution Start Solr with the cloud example: ./bin/solr start -e cloud -noprompt Verify Solr is running by going to http://localhost:8983/solr in your browser Setup NiFi Download the latest version of NiFi (0.6.1) and extract the distribution Start NiFi: ./bin/nifi.sh start Verify NiFi is running by going to http://localhost:8080/nifi in your browser Create a directory under the NiFi home for listening for new files: cd nifi-0.6.1
mkdir data
mkdir data/input Create the NiFi Flow Create a simple flow of GetFile -> PutSolrContentStream -> LogAttribute: The GetFile Input Directory should be ./data/input corresponding the directory created earlier. The configuration for PutSolrContentStream should be the following: The Solr Type is set to Cloud since we started the cloud example The Solr Location is the ZooKeeper connection string for the embedded ZK started by Solr The Collection is the example gettingstarted collection created by Solr The Content Stream Path is the path of the update handler in Solr used for extracting text, this corresponds to a path in solrconfix.ml The Content-Type is application/octet-stream so we can stream over any arbitrary data The extracting request handler is described in detail here: https://wiki.apache.org/solr/ExtractingRequestHandler We can see that a parameter called "literal.id" is normally passed on the URL. Any user defined properties on PutSolrContentStream will be passed as URL parameters to Solr, so by clicking the + icon in the top-right we can add this property and set it to the UUID of the flow file: Ingest & Query At this point we can copy any document into <nifi_home>/data/input and see if Solr can identify it. For this example I copied quickstart.html file from the Solr docs directory. After going to the Solr Admin UI and querying the "gettingstarted" collection for all documents, you should see the following results: We can see that Solr identified the document as "text/html", extracted the title as "Solr Quick Start", and has the id as the UUID of the FlowFile from NiFi. We can also see the extraction was done using Tika behind the scenes. From here you can send in any type of documents, PDF, Word, Excel, etc., and have Solr extract the text using Tika.
... View more
Labels:
06-28-2016
05:56 PM
7 Kudos
NiFi is not built on top of hadoop and therefore does not use MapReduce or any other processing platform. NiFi is a dataflow tool for moving data between systems, performing simple event processing, routing and transformations. Each node in a NiFi cluster runs the same flow, and it is up to the designer of the flow to partition the data across the NiFi cluster. This presentation shows strategies for how to divide the data across your cluster: http://www.slideshare.net/BryanBende/data-distribution-patterns-with-apache-nifi This presentation has an architecture diagram of what a cluster looks like with the internal repositories (slide 17): http://www.slideshare.net/BryanBende/nj-hadoop-meetup-apache-nifi-deep-dive
... View more
06-28-2016
03:10 PM
4 Kudos
There is a GetHBase processor that is made to incrementally extract data from an HBase table by keeping track of the last timestamp seen, and finding cells where the timestamp is greater than the last time seen. There is an open JIRA to create another processor that might be called ScanHBase where it is not based on the timestamps and would allow more general extraction.
... View more