We are running cloudera search on 3 servers with 128GB RAM each. I have the following MaxDirectMemory and slab count setting
-XX:MaxDirectMemorySize=40g -XX:+UseLargePages -Dsolr.hdfs.blockcache.slab.count=80
However, when we experience heavy writes, we see some replica servers failing with this error. The collection itself is only 12GB Can you please help on clarifying what is happening here?
SnapPull failed :java.lang.RuntimeException: The max direct memory is likely too low. Either increase it (by adding -XX:MaxDirectMemorySize=<size>g -XX:+UseLargePages to your containers startup args) or disable direct allocation using solr.hdfs.blockcache.direct.memory.allocation=false in solrconfig.xml. If you are putting the block cache on the heap, your java heap size might not be large enough. Failed allocating
Each server (total 3) has 128GB RAM with only cloudera search running. There are 3 collections in the server and each collection is about 10GB size. There are no shards (1 shard, 3 replica configuration). Java Max Heap (Xmx) is 40GB.
I see this error only when there is continuous writes happens.
This is just a solrcloud configuration. I dont have any separate replication configuration. The replication request handler looks like this:
<requestHandler name="/replication" class="solr.ReplicationHandler" >
To enable simple master/slave replication, uncomment one of the
sections below, depending on whether this solr instance should be
the "master" or a "slave". If this instance is a "slave" you will
also need to fill in the masterUrl to point to a real machine.
This is the full exception from the logs:
SnapPull failed :java.lang.RuntimeException: The max direct memory is likely too low. Either increase it (by adding -XX:MaxDirectMemorySize=<size>g -XX:+UseLargePages to your containers startup args) or disable direct allocation using solr.hdfs.blockcache.direct.memory.allocation=false in solrconfig.xml. If you are putting the block cache on the heap, your java heap size might not be large enough. Failed allocating ~10737.41824 MB.
Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed.
Thanks for the stack trace - that clears things up. I was thinking there was a reload, because a new HdfsDirectory was being created - but in fact, the reload is not and would not actually cause that necessarily.
The issue is that on some replications, Solr cannot use the same index directory it was using - it has to roll over to a new one. This can add another 10GB block cache as two HdfsDirectory's can briefly be around together.
This breif overlap is probably what's making you go over your max direct memory setting.
I'd use smaller block caches or raise that limit if you think you have the available RAM.
Once we have an option for a single memory pool for all the block caches, this will be a little nicer.
I've filed a Solr issue for the global block cache improvement:
You should be able to use one pool of memory for multiple collection's HDFS block caches.
We will pull that into Cloudera Search in a future release. Until then, you probably either have to raise the direct memory limit or lower the block cache size.
Thank you for logging the issue. In order to reduce the block cache size, I should just reduce this setting (solr.hdfs.blockcache.slab.count) right?