- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Solr architecture for a production environment
- Labels:
-
Apache Solr
Created 12-10-2015 09:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We need to deploy Solr 5.2.1 on HDP 2.3.2 on a production environment (3 master nodes with HA on HDFS, YARN and Hive, 13 worker nodes, 2 edge, 2 support and 2 security). Is there a "best practice" for production? This is a multi-purpose cluster in which Hive, Pig, HOYA and Spark jobs are currently running.
Created 12-11-2015 04:25 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For high throughput use cases, Solr (actually Solr in Cloud mode) should run on separate nodes. However for HDFS based indexes you may get slight performance degradation. You can colocate Solr with the Datanodes but you sacrifice latency. So since you are running Spark jobs also, I would recommend SolrCloud on a couple more nodes
Created 12-11-2015 04:25 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For high throughput use cases, Solr (actually Solr in Cloud mode) should run on separate nodes. However for HDFS based indexes you may get slight performance degradation. You can colocate Solr with the Datanodes but you sacrifice latency. So since you are running Spark jobs also, I would recommend SolrCloud on a couple more nodes
Created 12-11-2015 01:31 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
+1 to @Ancil McBarnett . I would add depending on how you will be accessing Solr, you may want a load balancer in front of your cloud. Any of the Solr instances, shard or replica, can service requests on the SolrCloud.
Created 12-11-2015 08:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Ancil McBarnett Thanks! We need to keep indexes on HDFS but we need also to index files (about 500.000) on HDFS (PDF, EML and P7F). Following your suggestion could we deploy Solr on all DataNodes and also on two master nodes?
@azeltov So is it correct to say that any Solr could service request on HTTP port 8983 (both Solr and Banana)? Do you have some suggestion about the load balancer? Thanks a lot!
Created 12-30-2015 07:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Andrea D'Orio You can point an F5 to all or any of the SOLR nodes. SOLR cloud is smart enough in distributing queries to the right shards and replicas. Round robin should be fine. Also, if you're using HDFS to store the indexes than the SOLR needs to sit on the data nodes or nodes with the HDFS client.
https://doc.lucidworks.com/lucidworks-hdpsearch/2.3/Guide-Install.html
Created 01-05-2016 01:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are using SolrJ from your client, then it will connect to zookeeper and automatically do the load balancing for you. If you are going to use SolrJ, then make sure use CloudSolrClient class