Support Questions

Find answers, ask questions, and share your expertise

Solr: How to index rich text document put in hdfs?

avatar
Contributor

Hi Guys!

I'm working on Solr, exactly on rich text document indexing.

I'm able to index pdf document with DirectoryIngestMapper specifying hdfs folder and so, job map&reduce runs Solr indexing operation.

Now I want index my files when they are putted in hdfs folders automatically without start yarn job everytimes and for each file.

How I can it or can you indicate me the documentation to do it?

When a file with the same name is saved in hdfs but with different content, SOLR will update the index or will create a new index?

For each indexing operation, SOLR will proceed to index only new files or all files and so all index data?

Thank you in advance, Regards, Giuseppe

1 ACCEPTED SOLUTION

avatar
Explorer

@Giuseppe Maldarizzi

HDF/NiFi can be used for this particular case. You can use the GetHDFS processor to listen for incoming files and index them with the PutSolr processor. This would also give you more control if you need it as you can then add additional metadata to index on the files such as source, etc.

If you prefer to do something a bit more custom you can use HDFS iNotify hooks and listen for incoming files to process. Look at this codebase done a while ago that implements a java daemon as well as a storm topology for processing document from HDFS and indexing into Solr. It should help you get started and give good direction.

Java Daemon: https://github.com/acesir/hdfs-daemon

Storm Topology: https://github.com/acesir/hdfs-storm-indexer

View solution in original post

2 REPLIES 2

avatar
Explorer

@Giuseppe Maldarizzi

HDF/NiFi can be used for this particular case. You can use the GetHDFS processor to listen for incoming files and index them with the PutSolr processor. This would also give you more control if you need it as you can then add additional metadata to index on the files such as source, etc.

If you prefer to do something a bit more custom you can use HDFS iNotify hooks and listen for incoming files to process. Look at this codebase done a while ago that implements a java daemon as well as a storm topology for processing document from HDFS and indexing into Solr. It should help you get started and give good direction.

Java Daemon: https://github.com/acesir/hdfs-daemon

Storm Topology: https://github.com/acesir/hdfs-storm-indexer

avatar

+1 on HDF/NiFi. It can make the whole process really easy for you through its graphical canvas screen to design the flow.

For custom solutions, I can think of two high level patterns -

1) Index the docs when they are pushed to HDFS - OR -

2) Run a job every so often that looks for new content and then index it.

The core logic of indexing the doc will be the same and will make use of ExtractingRequestHandler