question Re: How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler in Support Questions

How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

adaher — Fri, 16 Oct 2015 19:26:38 GMT

I indexed one of my pom.XML files in SOLR 5.2 on HDP 2.3 sandbox; After installing Doc Crawler and using it to search the pom.xml content, it successfully retrieved the proper XML document, however, DocCrawler stripped out all the TAGS associated with the XML file. Is there a configuration or custom parser that needs referencing to search and VIEW all content including the XML TAGS using Document Crawler?

Re: How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

abajwa — Fri, 16 Oct 2015 23:47:39 GMT

@Paul Codding @Piotr Pruski any ideas on this?

Re: How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

acesir — Mon, 19 Oct 2015 20:39:46 GMT

If you want to keep the XML tags then you should be indexing the document without using XML update handlers and just index everything as raw/plain text.

Modify schema xml file and cleanup the fields. Create some literal fields for metadata information about the file and just index the entire XML as a multi valued field. Honestly, this will make the search itself very poor as tags and words will be tokenized so to improve search and optimization add stop-words around tags to improve it.

Re: How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

adaher — Fri, 23 Oct 2015 05:16:02 GMT

I agree regarding the poor search. However customer asked how to search based on tags. A lot of their XML docs are very complex, so for the purposes of a demo I did, I converted the xmls to PDF. and all worked fine. I am not sure if that the best solution, but at least one way.