I want to use CDH5 + Search a job search engine - similar to indeed.com / careerjet.com / simplyhired.com etc.
I crawl the web pages and upload the full html into hdfs in order to be parsed with Tika & Morphlines and to make them searchable with Solr through Hue.
Now, the question is: documents (html format) are coming from a very wide variety of sources (job boards, employers websites, newspappers ads, forums etc). So, it is basicaly impossible to provide a solr schema.xml for each source.
Cloudera CDH 5+ Search seems to me like the perfect solution as it can undertake high working volume (hdfs), events monitoring (flume), parsing (Tika & Morphlines), indexing & search (Solr & Hue).