Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Using Cloudera Express CDH5 & Search as a job search engine

Using Cloudera Express CDH5 & Search as a job search engine

New Contributor

Hi, 

I want to use CDH5 + Search a job search engine - similar to indeed.com / careerjet.com / simplyhired.com etc. 

 

I crawl the web pages and upload the full html into hdfs in order to be parsed with Tika & Morphlines and to make them searchable with Solr through Hue. 

 

Now, the question is: documents (html format) are coming from a very wide variety of sources (job boards, employers websites, newspappers ads, forums etc). So, it is basicaly impossible to provide a solr schema.xml for each source. 

 

Cloudera CDH 5+ Search seems to me like the perfect solution as it can undertake high working volume (hdfs), events monitoring (flume), parsing (Tika & Morphlines), indexing & search (Solr & Hue). 

 

However, I am still puzzled about how to provide the user with data in the classic format (http://www.indeed.com/jobs?q=java&l=New+York) - WITHOUT creating a particular solr schema.xml for each source & document format?!!!

 

I was thinking to use dynamic fields only in Solr in order to show only the BEST MATCH fields from each document for Job titles and BEST MATCH field for Location from each document etc...

 

Any help is most wellcome!

 

Kind regards,

Christian

Don't have an account?
Coming from Hortonworks? Activate your account here