The reason I
wanted to complete this exercise was to provide a less complex solution; that
is, fewer moving parts, less configuration, and no coding compared to kafka /
storm or spark. Also, the example is easy to setup and demonstrate quickly.
Flume
compared to Kafka/Storm is limited by its declarative nature, but that is what
makes it easy to use. However, the morphline does even provide a java command
(with some potential performance side effects), so you can get pretty explicit.
I’ve read that
Flume can handle at 50,000 events per second on a single server, so while the
pipe may not be as fat as a Kafka/Storm pipe, it may be well suited for many
use cases.
Step-by-step guide
1. Take care of
dependencies. I am running HDP 2.2.4 Sandbox and the Solr that came with
it. To get started, you will need to add a lot of dependencies to
your /usr/hdp/current/flume-server/lib/. You can get all of the
dependencies from /opt/solr/solr/contrib/
and /opt/solr/solr/dist directory structure.
commons-fileupload-1.2.1.jar
config-1.0.2.jar
fontbox-1.8.4.jar
httpmime-4.3.1.jar
kite-morphlines-avro-0.12.1.jar
kite-morphlines-core-0.12.1.jar
kite-morphlines-json-0.12.1.jar
kite-morphlines-tika-core-0.12.1.jar
kite-morphlines-tika-decompress-0.12.1.jar
kite-morphlines-twitter-0.12.1.jar
lucene-analyzers-common-4.10.4.jar
lucene-analyzers-kuromoji-4.10.4.jar
lucene-analyzers-phonetic-4.10.4.jar
lucene-core-4.10.4.jar
lucene-queries-4.10.4.jar
lucene-spatial-4.10.4.jar
metrics-core-3.0.1.jar
metrics-healthchecks-3.0.1.jar
noggit-0.5.jar
org.restlet-2.1.1.jar
org.restlet.ext.servlet-2.1.1.jar
pdfbox-1.8.4.jar
solr-analysis-extras-4.10.4.jar
solr-cell-4.10.4.jar
solr-clustering-4.10.4.jar
solr-core-4.10.4.jar
solr-dataimporthandler-4.10.4.jar
solr-dataimporthandler-extras-4.10.4.jar
solr-langid-4.10.4.jar
solr-map-reduce-4.10.4.jar
solr-morphlines-cell-4.10.4.jar
solr-morphlines-core-4.10.4.jar
solr-solrj-4.10.4.jar
solr-test-framework-4.10.4.jar
solr-uima-4.10.4.jar
solr-velocity-4.10.4.jar
spatial4j-0.4.1.jar
tika-core-1.5.jar
tika-parsers-1.5.jar
tika-xmp-1.5.jar2.
2. Configure
SOLR. Next there are some important SOLR configurations:
solr.xml – The solr.xml included with collection1 was
unmodified
schema.xml – The schema.xml that is included with
collection1 is all you need. It includes the fields that SolrCell will
return when processing the PDF file. You need to make sure that you
capture the fields you want with the SolrCell command in the
morphline.conf file.
solorconfig.xml – The solorconfig.xml that is included
with collection1 is all you need. It includes the
ExtractingRequestHandler that you need to process the PDF file.