I have large set of small files , each file is around 7 – 10 K in size
Total I have 350K files with around 6 GB.
I have changed my flume configuration with many options but whatever the config change Solr takes 2 sec for each file to ingest
agent.sources = SpoolDirSrc
agent.channels = FileChannel
agent.sinks = SolrSink
# Configure Source
agent.sources.SpoolDirSrc.channels = fileChannel
agent.sources.SpoolDirSrc.type = spooldir
agent.sources.SpoolDirSrc.spoolDir = /app/home/solr/final
agent.sources.SpoolDirSrc.basenameHeader = true
#agent.sources.SpoolDirSrc.batchSize = 100000
agent.sources.SpoolDirSrc.fileHeader = true
agent.sources.SpoolDirSrc.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
# Use a channel that buffers events in memory
agent.channels.FileChannel.type = file
agent.channels.FileChannel.capacity = 1000
agent.channels.FileChannel.transactionCapacity = 1000
#agent.channels.FileChannel.transactionCapacity = 10000
# Configure Solr Sink
agent.sinks.SolrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent.sinks.SolrSink.morphlineFile = /etc/flume/conf/morphline.conf
#agent.sinks.SolrSink.batchsize = 100000
#agent.sinks.SolrSink.batchDurationMillis = 5000
agent.sinks.SolrSink.channel = fileChannel
agent.sinks.SolrSink.morphlineId = morphline1
agent.sinks.SolrSink.tika.config = tikaConfig.xml
agent.sinks.SolrSink.rollCount = 0
agent.sinks.SolrSink.rollInterval = 0
agent.sinks.SolrSink.rollsize = 100000000
agent.sinks.SolrSink.idleTimeout = 0
agent.sinks.SolrSink.batchSize = 100000
agent.sinks.SolrSink.txnEventMax = 10000000
agent.sources.SpoolDirSrc.channels = FileChannel
agent.sinks.SolrSink.channel = FileChannel
My Collection is on 2 shards and 1 replication
I do not have an issues ingesting data to HDFS but having issues with Solr only
Kindly let me know how do I make this better
Are you Solr indexes stored on HDFS? HDFS has more overhead for Solr than using local storage. Therefore, smaller files are more susceptible to the performance hit. Are you looking for near real-time indexing? If so, using local storage is always a better approach.
You might want to adjust your autoCommit settings to do a slightly more "bulk" indexing: https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig.
Thank you Michael for your reply.
I am not using HDFS for storing my Solr Indexes.
When i am indexing to Solr in parallel I am also archiving those files to HDFS, I was mentioning writing to HDFS has no issues.
I have also tried committing frequently and also at the end of the process but I see not much data indexed into Solr