Created on 05-17-2016 10:59 PM
The flow is as follows:
Spooling Directory Source > File Channel > MorphlineSolrSink
The reason I wanted to complete this exercise was to provide a less complex solution; that is, fewer moving parts, less configuration, and no coding compared to kafka / storm or spark. Also, the example is easy to setup and demonstrate quickly.
Flume compared to Kafka/Storm is limited by its declarative nature, but that is what makes it easy to use. However, the morphline does even provide a java command (with some potential performance side effects), so you can get pretty explicit.
I’ve read that Flume can handle at 50,000 events per second on a single server, so while the pipe may not be as fat as a Kafka/Storm pipe, it may be well suited for many use cases.
1. Take care of dependencies. I am running HDP 2.2.4 Sandbox and the Solr that came with it. To get started, you will need to add a lot of dependencies to your /usr/hdp/current/flume-server/lib/. You can get all of the dependencies from /opt/solr/solr/contrib/ and /opt/solr/solr/dist directory structure.
commons-fileupload-1.2.1.jar
config-1.0.2.jar
fontbox-1.8.4.jar
httpmime-4.3.1.jar
kite-morphlines-avro-0.12.1.jar
kite-morphlines-core-0.12.1.jar
kite-morphlines-json-0.12.1.jar
kite-morphlines-tika-core-0.12.1.jar
kite-morphlines-tika-decompress-0.12.1.jar
kite-morphlines-twitter-0.12.1.jar
lucene-analyzers-common-4.10.4.jar
lucene-analyzers-kuromoji-4.10.4.jar
lucene-analyzers-phonetic-4.10.4.jar
lucene-core-4.10.4.jar
lucene-queries-4.10.4.jar
lucene-spatial-4.10.4.jar
metrics-core-3.0.1.jar
metrics-healthchecks-3.0.1.jar
noggit-0.5.jar
org.restlet-2.1.1.jar
org.restlet.ext.servlet-2.1.1.jar
pdfbox-1.8.4.jar
solr-analysis-extras-4.10.4.jar
solr-cell-4.10.4.jar
solr-clustering-4.10.4.jar
solr-core-4.10.4.jar
solr-dataimporthandler-4.10.4.jar
solr-dataimporthandler-extras-4.10.4.jar
solr-langid-4.10.4.jar
solr-map-reduce-4.10.4.jar
solr-morphlines-cell-4.10.4.jar
solr-morphlines-core-4.10.4.jar
solr-solrj-4.10.4.jar
solr-test-framework-4.10.4.jar
solr-uima-4.10.4.jar
solr-velocity-4.10.4.jar
spatial4j-0.4.1.jar
tika-core-1.5.jar
tika-parsers-1.5.jar
tika-xmp-1.5.jar2.
2. Configure SOLR. Next there are some important SOLR configurations:
3. Flume Configuration
#agent config
agent1.sources = spooling_dir_src
agent1.sinks = solr_sink
agent1.channels = fileChannel
# Use a file channel
agent1.channels.fileChannel.type = file
#agent1.channels.fileChannel.capacity = 10000
#agent1.channels.fileChannel.transactionCapacity = 10000
# Configure source
agent1.sources.spooling_dir_src.channels = fileChannel
agent1.sources.spooling_dir_src.type = spooldir
agent1.sources.spooling_dir_src.spoolDir = /home/flume/dropzone
agent1.sources.spooling_dir_src.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
#Configure Solr Sink
agent1.sinks.solr_sink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent1.sinks.solr_sink.morphlineFile = /home/flume/morphline.conf
agent1.sinks.solr_sink.batchsize = 1000
agent1.sinks.solr_sink.batchDurationMillis = 2500
agent1.sinks.solr_sink.channel = fileChannel
4. Morphline Configuration File
solrLocator: {
collection : collection1
#zkHost : "127.0.0.1:9983"
zkHost : "127.0.0.1:2181"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{ detectMimeType { includeDefaultMimeTypes : true } }
{
solrCell {
solrLocator : ${solrLocator}
captureAttr : true
lowernames : true
capture : [title, author, content, content_type]
parsers : [ { parser : org.apache.tika.parser.pdf.PDFParser } ]
}
}
{ generateUUID { field : id } }
{ sanitizeUnknownSolrFields { solrLocator : ${solrLocator} } }
{ loadSolr: { solrLocator : ${solrLocator} } }
]
}
]
5. Start SOLR. I used the following command so I could watch the logging. Note I am using the embedded Zookeeper that starts with this command:
./solr start –f
6. Start Flume. I used the following command:
/usr/hdp/current/flume-server/bin/flume-ng agent --name agent1 --conf /etc/flume/conf/agent1 --conf-file /home/flume/flumeSolrSink.conf -Dflume.root.logger=DEBUG,console
7. Drop a PDF file into /home/flume/dropzone. If you're watching the log, you'll see when the process is completed.
8. In SOLR Admin, queries to run: