I am investigating how to index and search a huge number of pdf documents using Hadoop technology stack.
My data contain two parts: 1) raw pdf documents 2) fields data about the pdf documents, which have already been extracted by external applications. I find Solr is a good tool to index pdf documents based on the fields data (part 2), but where should I store raw pdf documents (part 1)?
My initial plan is to store pdf documents in HDFS and add the "HDFS path" to field data when building index using Solr. But I found some websites mention that HDFS is not good to store a huge number of small files. Can some give some suggestions for my scenario? Should I store the pdf documents in HBase or use other document-orient database like MongoDB?
It would be possible to store your individual PDF files in HDFS and have the HDFS path as an additional field, stored in the Solr index. What you need to consider here, HDFS is best at storing small number of very large files, so it is not effective to store large number of relatively small PDF files in HDFS.
2) store PDF files in HBase
It would also be possible to store the PDF files in a object store, like HBase. This is an option that is definitely feasible and I have seen several real life implementation of this design. In this case, you would store the HBase id in the Solr index.
3) store PDF files in the Solr index itself
I think it is also possible to store the original PDF file in the Solr index as well. You would use a BinaryField type and you would set the stored property to true. (Note that you could even accomplish the same with older version of Solr, lacking the BinaryField type. In this case, you would have to convert your PDF into text (e.g. with base64 encoding) then store this text value in a stored=true field. Upon retrieval, you would convert it back to PDF).
Without an estimation on the number of PDF files and the average size of a PDF, it would be hard to choose the best design. It could be also in important factor if you want to update your documents frequently or you just add to to the index once and then they won't change anymore.