Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Flume Ingestion to Solr with smaller files is very slow

Hi,

I have large set of small files , each file is around 7 – 10 K in size

Total I have 350K files with around 6 GB.

I have changed my flume configuration with many options but whatever the config change Solr takes 2 sec for each file to ingest

agent.sources = SpoolDirSrc

agent.channels = FileChannel

agent.sinks = SolrSink

# Configure Source

agent.sources.SpoolDirSrc.channels = fileChannel

agent.sources.SpoolDirSrc.type = spooldir

agent.sources.SpoolDirSrc.spoolDir = /app/home/solr/final

agent.sources.SpoolDirSrc.basenameHeader = true

#agent.sources.SpoolDirSrc.batchSize = 100000

agent.sources.SpoolDirSrc.fileHeader = true

agent.sources.SpoolDirSrc.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder

# Use a channel that buffers events in memory

agent.channels.FileChannel.type = file

agent.channels.FileChannel.capacity = 1000

agent.channels.FileChannel.transactionCapacity = 1000

#agent.channels.FileChannel.transactionCapacity = 10000

# Configure Solr Sink

agent.sinks.SolrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink

agent.sinks.SolrSink.morphlineFile = /etc/flume/conf/morphline.conf

#agent.sinks.SolrSink.batchsize = 100000

#agent.sinks.SolrSink.batchDurationMillis = 5000

agent.sinks.SolrSink.channel = fileChannel

agent.sinks.SolrSink.morphlineId = morphline1

agent.sinks.SolrSink.tika.config = tikaConfig.xml

agent.sinks.SolrSink.rollCount = 0

agent.sinks.SolrSink.rollInterval = 0

agent.sinks.SolrSink.rollsize = 100000000

agent.sinks.SolrSink.idleTimeout = 0

agent.sinks.SolrSink.batchSize = 100000

agent.sinks.SolrSink.txnEventMax = 10000000

agent.sources.SpoolDirSrc.channels = FileChannel

agent.sinks.SolrSink.channel = FileChannel

My Collection is on 2 shards and 1 replication

I do not have an issues ingesting data to HDFS but having issues with Solr only

Kindly let me know how do I make this better

Regards,

~Sri

3 REPLIES 3

And I do not have any issue when I ingest a large file

@Srinatha Anantharaman

Are you Solr indexes stored on HDFS? HDFS has more overhead for Solr than using local storage. Therefore, smaller files are more susceptible to the performance hit. Are you looking for near real-time indexing? If so, using local storage is always a better approach.

You might want to adjust your autoCommit settings to do a slightly more "bulk" indexing: https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig.

Thank you Michael for your reply.

I am not using HDFS for storing my Solr Indexes.

When i am indexing to Solr in parallel I am also archiving those files to HDFS, I was mentioning writing to HDFS has no issues.

I have also tried committing frequently and also at the end of the process but I see not much data indexed into Solr

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.