Support Questions
Find answers, ask questions, and share your expertise

Flume Ingestion to Solr with smaller files is very slow

Flume Ingestion to Solr with smaller files is very slow

Hi,

I have large set of small files , each file is around 7 – 10 K in size

Total I have 350K files with around 6 GB.

I have changed my flume configuration with many options but whatever the config change Solr takes 2 sec for each file to ingest

agent.sources = SpoolDirSrc

agent.channels = FileChannel

agent.sinks = SolrSink

# Configure Source

agent.sources.SpoolDirSrc.channels = fileChannel

agent.sources.SpoolDirSrc.type = spooldir

agent.sources.SpoolDirSrc.spoolDir = /app/home/solr/final

agent.sources.SpoolDirSrc.basenameHeader = true

#agent.sources.SpoolDirSrc.batchSize = 100000

agent.sources.SpoolDirSrc.fileHeader = true

agent.sources.SpoolDirSrc.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder

# Use a channel that buffers events in memory

agent.channels.FileChannel.type = file

agent.channels.FileChannel.capacity = 1000

agent.channels.FileChannel.transactionCapacity = 1000

#agent.channels.FileChannel.transactionCapacity = 10000

# Configure Solr Sink

agent.sinks.SolrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink

agent.sinks.SolrSink.morphlineFile = /etc/flume/conf/morphline.conf

#agent.sinks.SolrSink.batchsize = 100000

#agent.sinks.SolrSink.batchDurationMillis = 5000

agent.sinks.SolrSink.channel = fileChannel

agent.sinks.SolrSink.morphlineId = morphline1

agent.sinks.SolrSink.tika.config = tikaConfig.xml

agent.sinks.SolrSink.rollCount = 0

agent.sinks.SolrSink.rollInterval = 0

agent.sinks.SolrSink.rollsize = 100000000

agent.sinks.SolrSink.idleTimeout = 0

agent.sinks.SolrSink.batchSize = 100000

agent.sinks.SolrSink.txnEventMax = 10000000

agent.sources.SpoolDirSrc.channels = FileChannel

agent.sinks.SolrSink.channel = FileChannel

My Collection is on 2 shards and 1 replication

I do not have an issues ingesting data to HDFS but having issues with Solr only

Kindly let me know how do I make this better

Regards,

~Sri

3 REPLIES 3

Re: Flume Ingestion to Solr with smaller files is very slow

And I do not have any issue when I ingest a large file

Re: Flume Ingestion to Solr with smaller files is very slow

@Srinatha Anantharaman

Are you Solr indexes stored on HDFS? HDFS has more overhead for Solr than using local storage. Therefore, smaller files are more susceptible to the performance hit. Are you looking for near real-time indexing? If so, using local storage is always a better approach.

You might want to adjust your autoCommit settings to do a slightly more "bulk" indexing: https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig.

Re: Flume Ingestion to Solr with smaller files is very slow

Thank you Michael for your reply.

I am not using HDFS for storing my Solr Indexes.

When i am indexing to Solr in parallel I am also archiving those files to HDFS, I was mentioning writing to HDFS has no issues.

I have also tried committing frequently and also at the end of the process but I see not much data indexed into Solr