- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 05-17-2016 10:59 PM
How to Index PDF File with Flume and MorphlineSolrSink
The flow is as follows:
Spooling Directory Source > File Channel > MorphlineSolrSink
The reason I wanted to complete this exercise was to provide a less complex solution; that is, fewer moving parts, less configuration, and no coding compared to kafka / storm or spark. Also, the example is easy to setup and demonstrate quickly.
Flume compared to Kafka/Storm is limited by its declarative nature, but that is what makes it easy to use. However, the morphline does even provide a java command (with some potential performance side effects), so you can get pretty explicit.
I’ve read that Flume can handle at 50,000 events per second on a single server, so while the pipe may not be as fat as a Kafka/Storm pipe, it may be well suited for many use cases.
Step-by-step guide
1. Take care of dependencies. I am running HDP 2.2.4 Sandbox and the Solr that came with it. To get started, you will need to add a lot of dependencies to your /usr/hdp/current/flume-server/lib/. You can get all of the dependencies from /opt/solr/solr/contrib/ and /opt/solr/solr/dist directory structure.
commons-fileupload-1.2.1.jar
config-1.0.2.jar
fontbox-1.8.4.jar
httpmime-4.3.1.jar
kite-morphlines-avro-0.12.1.jar
kite-morphlines-core-0.12.1.jar
kite-morphlines-json-0.12.1.jar
kite-morphlines-tika-core-0.12.1.jar
kite-morphlines-tika-decompress-0.12.1.jar
kite-morphlines-twitter-0.12.1.jar
lucene-analyzers-common-4.10.4.jar
lucene-analyzers-kuromoji-4.10.4.jar
lucene-analyzers-phonetic-4.10.4.jar
lucene-core-4.10.4.jar
lucene-queries-4.10.4.jar
lucene-spatial-4.10.4.jar
metrics-core-3.0.1.jar
metrics-healthchecks-3.0.1.jar
noggit-0.5.jar
org.restlet-2.1.1.jar
org.restlet.ext.servlet-2.1.1.jar
pdfbox-1.8.4.jar
solr-analysis-extras-4.10.4.jar
solr-cell-4.10.4.jar
solr-clustering-4.10.4.jar
solr-core-4.10.4.jar
solr-dataimporthandler-4.10.4.jar
solr-dataimporthandler-extras-4.10.4.jar
solr-langid-4.10.4.jar
solr-map-reduce-4.10.4.jar
solr-morphlines-cell-4.10.4.jar
solr-morphlines-core-4.10.4.jar
solr-solrj-4.10.4.jar
solr-test-framework-4.10.4.jar
solr-uima-4.10.4.jar
solr-velocity-4.10.4.jar
spatial4j-0.4.1.jar
tika-core-1.5.jar
tika-parsers-1.5.jar
tika-xmp-1.5.jar2.
2. Configure SOLR. Next there are some important SOLR configurations:
- solr.xml – The solr.xml included with collection1 was unmodified
- schema.xml – The schema.xml that is included with collection1 is all you need. It includes the fields that SolrCell will return when processing the PDF file. You need to make sure that you capture the fields you want with the SolrCell command in the morphline.conf file.
- solorconfig.xml – The solorconfig.xml that is included with collection1 is all you need. It includes the ExtractingRequestHandler that you need to process the PDF file.
3. Flume Configuration
#agent config
agent1.sources = spooling_dir_src
agent1.sinks = solr_sink
agent1.channels = fileChannel
# Use a file channel
agent1.channels.fileChannel.type = file
#agent1.channels.fileChannel.capacity = 10000
#agent1.channels.fileChannel.transactionCapacity = 10000
# Configure source
agent1.sources.spooling_dir_src.channels = fileChannel
agent1.sources.spooling_dir_src.type = spooldir
agent1.sources.spooling_dir_src.spoolDir = /home/flume/dropzone
agent1.sources.spooling_dir_src.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
#Configure Solr Sink
agent1.sinks.solr_sink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent1.sinks.solr_sink.morphlineFile = /home/flume/morphline.conf
agent1.sinks.solr_sink.batchsize = 1000
agent1.sinks.solr_sink.batchDurationMillis = 2500
agent1.sinks.solr_sink.channel = fileChannel
4. Morphline Configuration File
solrLocator: {
collection : collection1
#zkHost : "127.0.0.1:9983"
zkHost : "127.0.0.1:2181"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{ detectMimeType { includeDefaultMimeTypes : true } }
{
solrCell {
solrLocator : ${solrLocator}
captureAttr : true
lowernames : true
capture : [title, author, content, content_type]
parsers : [ { parser : org.apache.tika.parser.pdf.PDFParser } ]
}
}
{ generateUUID { field : id } }
{ sanitizeUnknownSolrFields { solrLocator : ${solrLocator} } }
{ loadSolr: { solrLocator : ${solrLocator} } }
]
}
]
5. Start SOLR. I used the following command so I could watch the logging. Note I am using the embedded Zookeeper that starts with this command:
./solr start –f
6. Start Flume. I used the following command:
/usr/hdp/current/flume-server/bin/flume-ng agent --name agent1 --conf /etc/flume/conf/agent1 --conf-file /home/flume/flumeSolrSink.conf -Dflume.root.logger=DEBUG,console
7. Drop a PDF file into /home/flume/dropzone. If you're watching the log, you'll see when the process is completed.
8. In SOLR Admin, queries to run:
- text:* (or any text in the file)
- title:* (or the title)
- content_type:* (or pdf)
- author:* (or the author)
- use the content field for highlighting, not for searching