Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Super Collaborator

How to Index PDF File with Flume and MorphlineSolrSink

The flow is as follows:

Spooling Directory Source > File Channel > MorphlineSolrSink

The reason I wanted to complete this exercise was to provide a less complex solution; that is, fewer moving parts, less configuration, and no coding compared to kafka / storm or spark. Also, the example is easy to setup and demonstrate quickly.

Flume compared to Kafka/Storm is limited by its declarative nature, but that is what makes it easy to use. However, the morphline does even provide a java command (with some potential performance side effects), so you can get pretty explicit.

I’ve read that Flume can handle at 50,000 events per second on a single server, so while the pipe may not be as fat as a Kafka/Storm pipe, it may be well suited for many use cases.

Step-by-step guide

1. Take care of dependencies. I am running HDP 2.2.4 Sandbox and the Solr that came with it. To get started, you will need to add a lot of dependencies to your /usr/hdp/current/flume-server/lib/. You can get all of the dependencies from /opt/solr/solr/contrib/ and /opt/solr/solr/dist directory structure.

commons-fileupload-1.2.1.jar

config-1.0.2.jar

fontbox-1.8.4.jar

httpmime-4.3.1.jar

kite-morphlines-avro-0.12.1.jar

kite-morphlines-core-0.12.1.jar

kite-morphlines-json-0.12.1.jar

kite-morphlines-tika-core-0.12.1.jar

kite-morphlines-tika-decompress-0.12.1.jar

kite-morphlines-twitter-0.12.1.jar

lucene-analyzers-common-4.10.4.jar

lucene-analyzers-kuromoji-4.10.4.jar

lucene-analyzers-phonetic-4.10.4.jar

lucene-core-4.10.4.jar

lucene-queries-4.10.4.jar

lucene-spatial-4.10.4.jar

metrics-core-3.0.1.jar

metrics-healthchecks-3.0.1.jar

noggit-0.5.jar

org.restlet-2.1.1.jar

org.restlet.ext.servlet-2.1.1.jar

pdfbox-1.8.4.jar

solr-analysis-extras-4.10.4.jar

solr-cell-4.10.4.jar

solr-clustering-4.10.4.jar

solr-core-4.10.4.jar

solr-dataimporthandler-4.10.4.jar

solr-dataimporthandler-extras-4.10.4.jar

solr-langid-4.10.4.jar

solr-map-reduce-4.10.4.jar

solr-morphlines-cell-4.10.4.jar

solr-morphlines-core-4.10.4.jar

solr-solrj-4.10.4.jar

solr-test-framework-4.10.4.jar

solr-uima-4.10.4.jar

solr-velocity-4.10.4.jar

spatial4j-0.4.1.jar

tika-core-1.5.jar

tika-parsers-1.5.jar

tika-xmp-1.5.jar2.

2. Configure SOLR. Next there are some important SOLR configurations:

  • solr.xml – The solr.xml included with collection1 was unmodified
  • schema.xml – The schema.xml that is included with collection1 is all you need. It includes the fields that SolrCell will return when processing the PDF file. You need to make sure that you capture the fields you want with the SolrCell command in the morphline.conf file.
  • solorconfig.xml – The solorconfig.xml that is included with collection1 is all you need. It includes the ExtractingRequestHandler that you need to process the PDF file.

3. Flume Configuration

#agent config

agent1.sources = spooling_dir_src

agent1.sinks = solr_sink

agent1.channels = fileChannel

# Use a file channel

agent1.channels.fileChannel.type = file

#agent1.channels.fileChannel.capacity = 10000

#agent1.channels.fileChannel.transactionCapacity = 10000

# Configure source

agent1.sources.spooling_dir_src.channels = fileChannel

agent1.sources.spooling_dir_src.type = spooldir

agent1.sources.spooling_dir_src.spoolDir = /home/flume/dropzone

agent1.sources.spooling_dir_src.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder

#Configure Solr Sink

agent1.sinks.solr_sink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink

agent1.sinks.solr_sink.morphlineFile = /home/flume/morphline.conf

agent1.sinks.solr_sink.batchsize = 1000

agent1.sinks.solr_sink.batchDurationMillis = 2500

agent1.sinks.solr_sink.channel = fileChannel

4. Morphline Configuration File

solrLocator: {

collection : collection1

#zkHost : "127.0.0.1:9983"

zkHost : "127.0.0.1:2181"

}

morphlines : [

{

id : morphline1

importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

commands : [

{ detectMimeType { includeDefaultMimeTypes : true } }

{

solrCell {

solrLocator : ${solrLocator}

captureAttr : true

lowernames : true

capture : [title, author, content, content_type]

parsers : [ { parser : org.apache.tika.parser.pdf.PDFParser } ]

}

}

{ generateUUID { field : id } }

{ sanitizeUnknownSolrFields { solrLocator : ${solrLocator} } }

{ loadSolr: { solrLocator : ${solrLocator} } }

]

}

]

5. Start SOLR. I used the following command so I could watch the logging. Note I am using the embedded Zookeeper that starts with this command:

./solr start –f

6. Start Flume. I used the following command:

/usr/hdp/current/flume-server/bin/flume-ng agent --name agent1 --conf /etc/flume/conf/agent1 --conf-file /home/flume/flumeSolrSink.conf -Dflume.root.logger=DEBUG,console

7. Drop a PDF file into /home/flume/dropzone. If you're watching the log, you'll see when the process is completed.

8. In SOLR Admin, queries to run:

  • text:* (or any text in the file)
  • title:* (or the title)
  • content_type:* (or pdf)
  • author:* (or the author)
  • use the content field for highlighting, not for searching
2,607 Views
0 Kudos