Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What technology shall be used?

What technology shall be used?

New Contributor

Hi I have a JMS and that should be populated with a different docs (pdf, pics, ms docs). There are potentially millions of those docs can be pushed during the day (plus updates). I want to be able to perform a real-time search on those. The question I have is, what is the best technology to achieve this? I know for search and indexing Solr shall be used. I guess for processing data in real time Storm should be used (I also have seen Solr bolt). What I am not sure is how Storm can process free format files, index them and make available for search. Can Solr Storm bolt do this? Or do I need to use something else?


Re: What technology shall be used?

It depends on what kind of processing you expect Storm to do. If it's simple transformations, then I would highly recommend that you consider Nifi/HDF instead. Nifi will allow you to easily connect to different data sources to ingest the data while doing some simple processing/transformations of the data before landing it at your destination. It supports things like type conversions, text parsing, search and replace, encryption, etc... It also has connectors to several data sources/destinations, including JMS and Solr. Fuerthermore, you can setup your workflow through a simple to use UI

You can read more about it in the below links.

For sample data flows with different transformations take a look below.

Regarding the indexing, once the data/files are landed in Solr, you can wait for Solr to auto-index or you can have Nifi call Solr API to begin an indexing process.

Take a look below for details about Solr indexing.

As always, if you find this post helpful, don't forget to "Accept" the answer

Don't have an account?
Coming from Hortonworks? Activate your account here