Support Questions

sdutta · ‎01-13-2016

We have a customer that needs to update few billion documents to SolrCloud. I know the suggested way of using is SolrCloudClient, for its load balancing feature.

As per docs - CloudSolrClient

SolrJ client class to communicate with SolrCloud. Instances of this class communicate with Zookeeper to discover Solr endpoints for SolrCloud collections, and then use the LBHttpSolrClient to issue requests. This class assumes the id field for your documents is called 'id' - if this is not the case, you must set the right name with setIdField(String).

As per the docs - ConcurrentUpdateSolrClient

ConcurrentUpdateSolrClient buffers all added documents and writes them into open HTTP connections. This class is thread safe. Params from UpdateRequest are converted to http request parameters. When params change between UpdateRequests a new HTTP request is started. Although any SolrClient request can be made with this implementation, it is only recommended to use ConcurrentUpdateSolrClient with /update requests. The class HttpSolrClient is better suited for the query interface.

Now since with ConcurrentUdateSolrClient I am able to use a queue and a pool of threads, which makes it more attractive to use over CloudSolrClient which will use a HTTPSolrClient once it gets a set of nodes to do the updates.

I would love to hear more in depth discussion on these 2 APIs.

Thanks

Shivaji

bdurai · ‎01-13-2016

@sdutta in SolrCloud you should be using CloudSolrClient class. It should take care of everything you mentioned. Gets the active Solr servers from Zookeeper. And when you add the document, it will automatically send it to the server which is hosting the shard for the id, etc. It also keeps track if any Solr server is out of commission and automatically reconfigures itself.

CloudSolrClient solrCloudClient = new CloudSolrClient(zkHosts);

solrCloudClient.setDefaultCollection(collectionName);

View solution in original post

bdurai · ‎01-13-2016

@sdutta in SolrCloud you should be using CloudSolrClient class. It should take care of everything you mentioned. Gets the active Solr servers from Zookeeper. And when you add the document, it will automatically send it to the server which is hosting the shard for the id, etc. It also keeps track if any Solr server is out of commission and automatically reconfigures itself.

CloudSolrClient solrCloudClient = new CloudSolrClient(zkHosts);

solrCloudClient.setDefaultCollection(collectionName);

sdutta · ‎01-13-2016

Bosco, CloudSolrClient will return an LBHTTPClient (which load balances across the nodes). But I do not see that LBHTTPClient is multithreaded. So, the question begs, which has a higher throughput?

bdurai · ‎01-13-2016

You will have to first see where the bottle neck is. Regardless how much you are going to push to the Solr server, it can only index only so many. If you feel transport is the main issue, then you can just create couple of threads and each thread can have it's own solrClient instance.

Secondly, you need to batch all your requests and you shouldn't commit from the client side. You should configure auto-commit on the Solr Server side and let it do the final commit. Between Solr doing the buffering v/s you doing the batching, I am not sure what would be the difference.

sdutta · ‎01-13-2016

@bdurai

I posted on the Solr community and got the below answer from a Committer :-

It's usually not all that difficult to write a multi-threaded client that uses CloudSolrClient, or even fire up multiple instances of the SolrJ client (assuming they can work

on discreet sections of the documents you need to index).

That avoids the problem Shawn alludes to. Plus other

issues. If you do _not_ use CloudSolrClient, then all the

docs go to some node in the system that then sub-divides

the list (and you really should update in batches, see:

https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/)

then the node that receives the packet sub-divides it

into groups based on what shard they should be part of

and forwards them to the leaders for that shard, very

significantly increasing the numbers of conversations

being carried on between Solr nodes. Times the number

of threads you're specifying with CUSC (I really regret

the renaming from ConcurrentUpdateSolrServer, I liked

writing CUSS).

With CloudSolrClient, you can scale nearly linearly with

the number of shards. Not so with CUSC.

FWIW,

Erick

christopher_w_m · ‎01-13-2016

Throwing my 2 cents in since I've spent an insane amount of time working with Solr on this exact problem.

ConcurrentUpdateSolrClient is really easy to get going and you can get a high throughput just by increasing the number of threads. However, at some point it just won't be scalable or efficient once you have a bunch of Solr nodes.

If you are using Solr Cloud, then the CloudSolrClient is definitely the recommended way to go but, in my experience, it is much, much harder to get high throughput. Batching documents is pretty much a requirement. You can't really just increase the number of threads because each one opens a connection to Zookeeper.

If you decide to go with CloudSolrClient, take a look at the code in storm-solr.

Cloudera Community

Support Questions

ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud

Ambari API - Run all service checks (bulk)

How-To: Cleanup SolrCloud entries in ZooKeeper

Solr vs SolrCloud

Suggestions for Bulk Loading Large Files into HBas...

NiFi: How to detect updates to S3 files and insert...

Re: Solrcloud

Solr Rule-Based Authorization Plugin With External...

How to setup cross data center replication in Solr...

How to Deploy Apache Solr as SolrCloud on HDFS in ...

Does phoenix update global index during bulk load?