Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

SOLVED Go to solution
Highlighted

How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

New Contributor

I indexed one of my pom.XML files in SOLR 5.2 on HDP 2.3 sandbox; After installing Doc Crawler and using it to search the pom.xml content, it successfully retrieved the proper XML document, however, DocCrawler stripped out all the TAGS associated with the XML file. Is there a configuration or custom parser that needs referencing to search and VIEW all content including the XML TAGS using Document Crawler?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

New Contributor

If you want to keep the XML tags then you should be indexing the document without using XML update handlers and just index everything as raw/plain text.

Modify schema xml file and cleanup the fields. Create some literal fields for metadata information about the file and just index the entire XML as a multi valued field. Honestly, this will make the search itself very poor as tags and words will be tokenized so to improve search and optimization add stop-words around tags to improve it.

3 REPLIES 3

Re: How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

@Paul Codding @Piotr Pruski any ideas on this?

Re: How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

New Contributor

If you want to keep the XML tags then you should be indexing the document without using XML update handlers and just index everything as raw/plain text.

Modify schema xml file and cleanup the fields. Create some literal fields for metadata information about the file and just index the entire XML as a multi valued field. Honestly, this will make the search itself very poor as tags and words will be tokenized so to improve search and optimization add stop-words around tags to improve it.

Re: How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

New Contributor

I agree regarding the poor search. However customer asked how to search based on tags. A lot of their XML docs are very complex, so for the purposes of a demo I did, I converted the xmls to PDF. and all worked fine. I am not sure if that the best solution, but at least one way.