Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Bulk Inserts into SOLR

avatar

Are there any examples of doing bulk inserts into SOLR?

Storm ingestion pipeline is processing around 1 million messages a second and want to send a bulk request to Solr.

1 ACCEPTED SOLUTION

avatar

Mike, are you asking about a commit batch size from the client side? This is controllable in the API, but it may have adverse effects based on how powerful your SolrCloud cluster is. Trade throughput for latency, old as the world.

However, if you can tolerate some larger latency, maybe consider dumping stream to HDFS or streaming into Hive, and then use e.g. MR2, Pig or Hive connectors that Solr provides: https://doc.lucidworks.com/hdpsearch23/Guide-Jobs.html This will allow for ultimate parallelism and throughput.

View solution in original post

4 REPLIES 4

avatar

Mike, are you asking about a commit batch size from the client side? This is controllable in the API, but it may have adverse effects based on how powerful your SolrCloud cluster is. Trade throughput for latency, old as the world.

However, if you can tolerate some larger latency, maybe consider dumping stream to HDFS or streaming into Hive, and then use e.g. MR2, Pig or Hive connectors that Solr provides: https://doc.lucidworks.com/hdpsearch23/Guide-Jobs.html This will allow for ultimate parallelism and throughput.

avatar

You could use the official SolrBolt from Lucidworks (https://github.com/LucidWorks/storm-solr) and put your messages into Solr using batch sizes of 1000 or even 10000.

As Andrew pointed out, the second option is to write your messages to HDFS and afterwards use the Job Jar to load the data into Solr. The command looks something like this:

hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.DirectoryIngestMapper  --collection my_collection -i /data/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat --solrServer http://c6601.ambari.apache.org:8983/solr

avatar
Super Collaborator

@Jonas Straub

The storm-solr codebase has SolrBoltAction that should be used if you are converting your Storm tuples to SolrInputDocument(s) and indexing them. This SolrBoltAction supports microbatching of these docs and then indexing them via the method SolrBoltAction.sendBatchToSolr.

The code base also has a SolrJsonBoltAction that extends SolrBoltAction but should be used if you are indexing Json Docs. I have noticed that SolrJsonBoltAction does not support batch indexing of JSON docs via the ContentStreamUpdateRequest where as the SolrBoltAction does.

We need to understand how you do batch indexing of json docs in 1 solr call. I did notice that ContentStreamUpdateRequest.addContentStream just adds content streams to a collection but when putting multiple jsondocs in that collection and then executing the request, solr only indexes 1 item in the collection.

avatar
Master Guru

From looking at SolrJsonBoltAction, if the incoming object is an array of JSON documents, and you did bolt.setSplit("/") it should split the array into multiple documents on the Solr side. I have never used the storm-solr project, but I'm saying this based on having used the JSON ContentStreamUpdateRequest in the past.

I'm not sure if this capability requires a specific version of Solr, but using >= 5.0 I had previously put together some test cases for this:

https://github.com/bbende/solrj-custom-json-update/blob/master/src/test/java/org/apache/solr/IndexJS...