I have got millions of small xml's (in KBs) stored in zipped file at S3. I need to create a solution to provide a search capability over it so that if a request comes up for a certain tag text then all xmls(in whole) containing that text should be returned.
I am thinking to create a sequence file for the same to avoid small files problem which will be indexed using Solr.
1. I read there is "kite-morphlines-hadoop-sequencefile" morphline command that can be used to read and index it into Solr. Can anyone share any working example of the same ?
2. Also, can anyone share any insight on what should I do to send the whole xml/xmls based on search criteria back to the requestor given the fact that it will be in a sequence file.