Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Central Document Repository on HDFS

New Contributor

Hello,

We are going to build documents repository (Word, PDF, Excel, pptx, ...).
Is it a good idea to use HDFS + Solr for such repository?

 

Key requirements are:
1. Store documents with some metadata about documents
2. Full text search of documents
3. Search documents based on metadata about documents
4. Retrive documents from repository
5. In the future we are going to do Natural Language Processing on Word/PDF documents.

 

Maybe we should better use any other technologies from Hadoop ecosystem like: Ozon or any database like Hbase?
Let's assume that we use CDP Private Cloud.

 

Best regards
Tomek

1 REPLY 1

New Contributor

Hello,

I would like to refresh this topic.

Do you have if is it possible to build efficient documents repository on HDFS?

I am concerned if many small files stored end retrived from HDFS will be effective solution?

Best regards

Tomek 

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.