Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Solr architecture for a production environment

avatar
Rising Star

We need to deploy Solr 5.2.1 on HDP 2.3.2 on a production environment (3 master nodes with HA on HDFS, YARN and Hive, 13 worker nodes, 2 edge, 2 support and 2 security). Is there a "best practice" for production? This is a multi-purpose cluster in which Hive, Pig, HOYA and Spark jobs are currently running.

1 ACCEPTED SOLUTION

avatar

For high throughput use cases, Solr (actually Solr in Cloud mode) should run on separate nodes. However for HDFS based indexes you may get slight performance degradation. You can colocate Solr with the Datanodes but you sacrifice latency. So since you are running Spark jobs also, I would recommend SolrCloud on a couple more nodes

View solution in original post

5 REPLIES 5

avatar

For high throughput use cases, Solr (actually Solr in Cloud mode) should run on separate nodes. However for HDFS based indexes you may get slight performance degradation. You can colocate Solr with the Datanodes but you sacrifice latency. So since you are running Spark jobs also, I would recommend SolrCloud on a couple more nodes

avatar

+1 to @Ancil McBarnett . I would add depending on how you will be accessing Solr, you may want a load balancer in front of your cloud. Any of the Solr instances, shard or replica, can service requests on the SolrCloud.

avatar
Rising Star

@Ancil McBarnett Thanks! We need to keep indexes on HDFS but we need also to index files (about 500.000) on HDFS (PDF, EML and P7F). Following your suggestion could we deploy Solr on all DataNodes and also on two master nodes?

@azeltov So is it correct to say that any Solr could service request on HTTP port 8983 (both Solr and Banana)? Do you have some suggestion about the load balancer? Thanks a lot!

avatar

@Andrea D'Orio You can point an F5 to all or any of the SOLR nodes. SOLR cloud is smart enough in distributing queries to the right shards and replicas. Round robin should be fine. Also, if you're using HDFS to store the indexes than the SOLR needs to sit on the data nodes or nodes with the HDFS client.

https://doc.lucidworks.com/lucidworks-hdpsearch/2.3/Guide-Install.html

avatar
Rising Star

If you are using SolrJ from your client, then it will connect to zookeeper and automatically do the load balancing for you. If you are going to use SolrJ, then make sure use CloudSolrClient class