<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102252#M65212</link>
    <description>&lt;P&gt;Bosco, CloudSolrClient will return an LBHTTPClient (which load balances across the nodes). But I do not see that LBHTTPClient is multithreaded. So, the question begs, which has a higher throughput?&lt;/P&gt;</description>
    <pubDate>Wed, 13 Jan 2016 10:47:24 GMT</pubDate>
    <dc:creator>sdutta</dc:creator>
    <dc:date>2016-01-13T10:47:24Z</dc:date>
    <item>
      <title>ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102250#M65210</link>
      <description>&lt;P&gt;We have a customer that needs to update few billion documents to SolrCloud. I know the suggested way of using is SolrCloudClient, for its load balancing feature. &lt;/P&gt;&lt;P&gt;As per docs - CloudSolrClient&lt;/P&gt;&lt;P&gt;SolrJ client class to communicate with SolrCloud. Instances of this class communicate with Zookeeper to discover Solr endpoints for SolrCloud collections, and then use the &lt;A href="http://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/impl/LBHttpSolrClient.html"&gt;&lt;CODE&gt;LBHttpSolrClient&lt;/CODE&gt;&lt;/A&gt; to issue requests. This class assumes the id field for your documents is called 'id' - if this is not the case, you must set the right name with &lt;A href="http://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrClient.html#setIdField%28java.lang.String%29"&gt;&lt;CODE&gt;setIdField(String)&lt;/CODE&gt;&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;As per the docs - ConcurrentUpdateSolrClient&lt;/P&gt;&lt;P&gt;
ConcurrentUpdateSolrClient buffers all added documents and writes them into open HTTP connections. This class is thread safe. Params from &lt;A href="http://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/request/UpdateRequest.html"&gt;&lt;CODE&gt;UpdateRequest&lt;/CODE&gt;&lt;/A&gt; are converted to http request parameters. When params change between UpdateRequests a new HTTP request is started. Although any SolrClient request can be made with this implementation, it is only recommended to use ConcurrentUpdateSolrClient with /update requests. The class &lt;A href="http://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrClient.html"&gt;&lt;CODE&gt;HttpSolrClient&lt;/CODE&gt;&lt;/A&gt; is better suited for the query interface.&lt;/P&gt;&lt;P&gt;Now since with ConcurrentUdateSolrClient I am able to use a queue and a pool of threads, which makes it more attractive to use over CloudSolrClient which will use a HTTPSolrClient once it gets a set of nodes to do the updates.&lt;/P&gt;&lt;P&gt;I would love to hear more in depth discussion on these 2 APIs. &lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Shivaji&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jan 2016 08:32:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102250#M65210</guid>
      <dc:creator>sdutta</dc:creator>
      <dc:date>2016-01-13T08:32:03Z</dc:date>
    </item>
    <item>
      <title>Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102251#M65211</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/137/sdutta.html" nodeid="137"&gt;@sdutta&lt;/A&gt; in SolrCloud you should be using CloudSolrClient class. It should take care of everything you mentioned. Gets the active Solr servers from Zookeeper. And when you add the document, it will automatically send it to the server which is hosting the shard for the id, etc. It also keeps track if any Solr server is out of commission and automatically reconfigures itself.&lt;/P&gt;&lt;P&gt;CloudSolrClient solrCloudClient = new CloudSolrClient(zkHosts); &lt;/P&gt;&lt;P&gt;solrCloudClient.setDefaultCollection(collectionName);&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jan 2016 09:27:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102251#M65211</guid>
      <dc:creator>bdurai</dc:creator>
      <dc:date>2016-01-13T09:27:07Z</dc:date>
    </item>
    <item>
      <title>Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102252#M65212</link>
      <description>&lt;P&gt;Bosco, CloudSolrClient will return an LBHTTPClient (which load balances across the nodes). But I do not see that LBHTTPClient is multithreaded. So, the question begs, which has a higher throughput?&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jan 2016 10:47:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102252#M65212</guid>
      <dc:creator>sdutta</dc:creator>
      <dc:date>2016-01-13T10:47:24Z</dc:date>
    </item>
    <item>
      <title>Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102253#M65213</link>
      <description>&lt;P&gt;You will have to first see where the bottle neck is. Regardless how much you are going to push to the Solr server, it can only index only so many. If you feel transport is the main issue, then you can just create couple of threads and each thread can have it's own solrClient instance. &lt;/P&gt;&lt;P&gt;Secondly, you need to batch all your requests and you shouldn't commit from the client side. You should configure auto-commit on the Solr Server side and let it do the final commit. Between Solr doing the buffering v/s you doing the batching, I am not sure what would be the difference.&lt;/P&gt;</description>
      <pubDate>Thu, 14 Jan 2016 01:32:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102253#M65213</guid>
      <dc:creator>bdurai</dc:creator>
      <dc:date>2016-01-14T01:32:37Z</dc:date>
    </item>
    <item>
      <title>Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102254#M65214</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/229/bdurai.html" nodeid="229"&gt;@bdurai&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I posted on the Solr community and got the below answer from a Committer :-&lt;/P&gt;&lt;P&gt;It's usually not all that difficult to write a multi-threaded client that uses CloudSolrClient, or even fire up multiple instances of the SolrJ client (assuming they can work&lt;/P&gt;&lt;P&gt;on discreet sections of the documents you need to index).&lt;/P&gt;&lt;P&gt;That avoids the problem Shawn alludes to. Plus other&lt;/P&gt;&lt;P&gt;issues. If you do _not_ use CloudSolrClient, then all the&lt;/P&gt;&lt;P&gt;docs go to some node in the system that then sub-divides&lt;/P&gt;&lt;P&gt;the list (and you really should update in batches, see:&lt;/P&gt;&lt;P&gt;&lt;A href="https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/"&gt;https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/&lt;/A&gt;)&lt;/P&gt;&lt;P&gt;then the node that receives the packet sub-divides it&lt;/P&gt;&lt;P&gt;into groups based on what shard they should be part of&lt;/P&gt;&lt;P&gt;and forwards them to the leaders for that shard, very&lt;/P&gt;&lt;P&gt;significantly increasing the numbers of conversations&lt;/P&gt;&lt;P&gt;being carried on between Solr nodes. Times the number&lt;/P&gt;&lt;P&gt;of threads you're specifying with CUSC (I really regret&lt;/P&gt;&lt;P&gt;the renaming from ConcurrentUpdateSolrServer, I liked&lt;/P&gt;&lt;P&gt;writing CUSS).&lt;/P&gt;&lt;P&gt;With CloudSolrClient, you can scale nearly linearly with&lt;/P&gt;&lt;P&gt;the number of shards. Not so with CUSC.&lt;/P&gt;&lt;P&gt;FWIW,&lt;/P&gt;&lt;P&gt;Erick&lt;/P&gt;</description>
      <pubDate>Thu, 14 Jan 2016 02:07:45 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102254#M65214</guid>
      <dc:creator>sdutta</dc:creator>
      <dc:date>2016-01-14T02:07:45Z</dc:date>
    </item>
    <item>
      <title>Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102255#M65215</link>
      <description>&lt;P&gt;Throwing my 2 cents in since I've spent an insane amount of time working with Solr on this exact problem.&lt;/P&gt;&lt;P&gt;ConcurrentUpdateSolrClient is really easy to get going and you can get a high throughput just by increasing the number of threads. However, at some point it just won't be scalable or efficient once you have a bunch of Solr nodes.&lt;/P&gt;&lt;P&gt;If you are using Solr Cloud, then the CloudSolrClient is definitely the recommended way to go but, in my experience, it is much, much harder to get high throughput. Batching documents is pretty much a requirement. You can't really just increase the number of threads because each one opens a connection to Zookeeper.&lt;/P&gt;&lt;P&gt;If you decide to go with CloudSolrClient, take a look at the code in &lt;A href="https://github.com/LucidWorks/storm-solr/tree/master/src/main/java/com/lucidworks/storm/solr"&gt;storm-solr&lt;/A&gt;. &lt;/P&gt;</description>
      <pubDate>Thu, 14 Jan 2016 03:28:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ConcurrentUpdateSolrClient-vs-CloudSolrClient-for-bulk/m-p/102255#M65215</guid>
      <dc:creator>christopher_w_m</dc:creator>
      <dc:date>2016-01-14T03:28:52Z</dc:date>
    </item>
  </channel>
</rss>

