Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How do I query and view content including TAGS of an indexed XML file in Solr while using Doc Crawler

avatar
Contributor

I indexed one of my pom.XML files in SOLR 5.2 on HDP 2.3 sandbox; After installing Doc Crawler and using it to search the pom.xml content, it successfully retrieved the proper XML document, however, DocCrawler stripped out all the TAGS associated with the XML file. Is there a configuration or custom parser that needs referencing to search and VIEW all content including the XML TAGS using Document Crawler?

1 ACCEPTED SOLUTION

avatar
Explorer

If you want to keep the XML tags then you should be indexing the document without using XML update handlers and just index everything as raw/plain text.

Modify schema xml file and cleanup the fields. Create some literal fields for metadata information about the file and just index the entire XML as a multi valued field. Honestly, this will make the search itself very poor as tags and words will be tokenized so to improve search and optimization add stop-words around tags to improve it.

View solution in original post

3 REPLIES 3

avatar

@Paul Codding @Piotr Pruski any ideas on this?

avatar
Explorer

If you want to keep the XML tags then you should be indexing the document without using XML update handlers and just index everything as raw/plain text.

Modify schema xml file and cleanup the fields. Create some literal fields for metadata information about the file and just index the entire XML as a multi valued field. Honestly, this will make the search itself very poor as tags and words will be tokenized so to improve search and optimization add stop-words around tags to improve it.

avatar
Contributor

I agree regarding the poor search. However customer asked how to search based on tags. A lot of their XML docs are very complex, so for the purposes of a demo I did, I converted the xmls to PDF. and all worked fine. I am not sure if that the best solution, but at least one way.