Created 10-16-2015 12:26 PM
I indexed one of my pom.XML files in SOLR 5.2 on HDP 2.3 sandbox; After installing Doc Crawler and using it to search the pom.xml content, it successfully retrieved the proper XML document, however, DocCrawler stripped out all the TAGS associated with the XML file. Is there a configuration or custom parser that needs referencing to search and VIEW all content including the XML TAGS using Document Crawler?
Created 10-19-2015 01:39 PM
If you want to keep the XML tags then you should be indexing the document without using XML update handlers and just index everything as raw/plain text.
Modify schema xml file and cleanup the fields. Create some literal fields for metadata information about the file and just index the entire XML as a multi valued field. Honestly, this will make the search itself very poor as tags and words will be tokenized so to improve search and optimization add stop-words around tags to improve it.
Created 10-16-2015 04:47 PM
@Paul Codding @Piotr Pruski any ideas on this?
Created 10-19-2015 01:39 PM
If you want to keep the XML tags then you should be indexing the document without using XML update handlers and just index everything as raw/plain text.
Modify schema xml file and cleanup the fields. Create some literal fields for metadata information about the file and just index the entire XML as a multi valued field. Honestly, this will make the search itself very poor as tags and words will be tokenized so to improve search and optimization add stop-words around tags to improve it.
Created 10-22-2015 10:16 PM
I agree regarding the poor search. However customer asked how to search based on tags. A lot of their XML docs are very complex, so for the purposes of a demo I did, I converted the xmls to PDF. and all worked fine. I am not sure if that the best solution, but at least one way.