Hi I have a JMS and that should be populated with a different docs (pdf, pics, ms docs). There are potentially millions of those docs can be pushed during the day (plus updates). I want to be able to perform a real-time search on those. The question I have is, what is the best technology to achieve this? I know for search and indexing Solr shall be used. I guess for processing data in real time Storm should be used (I also have seen Solr bolt). What I am not sure is how Storm can process free format files, index them and make available for search. Can Solr Storm bolt do this? Or do I need to use something else?
It depends on what kind of processing you expect Storm to do. If it's simple transformations, then I would highly recommend that you consider Nifi/HDF instead. Nifi will allow you to easily connect to different data sources to ingest the data while doing some simple processing/transformations of the data before landing it at your destination. It supports things like type conversions, text parsing, search and replace, encryption, etc... It also has connectors to several data sources/destinations, including JMS and Solr. Fuerthermore, you can setup your workflow through a simple to use UI
You can read more about it in the below links.
For sample data flows with different transformations take a look below.
Regarding the indexing, once the data/files are landed in Solr, you can wait for Solr to auto-index or you can have Nifi call Solr API to begin an indexing process.
Take a look below for details about Solr indexing.
As always, if you find this post helpful, don't forget to "Accept" the answer