Support Questions

Find answers, ask questions, and share your expertise

Central Document Repository on HDFS

avatar
Explorer

Hello,

We are going to build documents repository (Word, PDF, Excel, pptx, ...).
Is it a good idea to use HDFS + Solr for such repository?

 

Key requirements are:
1. Store documents with some metadata about documents
2. Full text search of documents
3. Search documents based on metadata about documents
4. Retrive documents from repository
5. In the future we are going to do Natural Language Processing on Word/PDF documents.

 

Maybe we should better use any other technologies from Hadoop ecosystem like: Ozon or any database like Hbase?
Let's assume that we use CDP Private Cloud.

 

Best regards
Tomek

1 REPLY 1

avatar
Explorer

Hello,

I would like to refresh this topic.

Do you have if is it possible to build efficient documents repository on HDFS?

I am concerned if many small files stored end retrived from HDFS will be effective solution?

Best regards

Tomek