Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Snap pull failed error

Snap pull failed error

Explorer

 

Hello,

 

We are running cloudera search on 3 servers with 128GB RAM each. I have the following MaxDirectMemory and slab count setting

 

-XX:MaxDirectMemorySize=40g -XX:+UseLargePages -Dsolr.hdfs.blockcache.slab.count=80

 

However, when we experience heavy writes, we see some replica servers failing with this error. The collection itself is only 12GB Can you please help on clarifying what is happening here?

 

SnapPull failed :java.lang.RuntimeException: The max direct memory is likely too low.  Either increase it (by adding -XX:MaxDirectMemorySize=<size>g -XX:+UseLargePages to your containers startup args) or disable direct allocation using solr.hdfs.blockcache.direct.memory.allocation=false in solrconfig.xml. If you are putting the block cache on the heap, your java heap size might not be large enough. Failed allocating
9 REPLIES 9

Re: Snap pull failed error

Cloudera Employee
By default, each slab is 128MB. That's 10GB per SolrCore. How much RAM do you actually have and are you oversharding?

Based on that error, it almost seems like replication is doing a SolrCore reload - if that was the case, there could briefly be two SolrCores around, each with a 10GB block cache. Eventually we will have an option for one block cache memory pool for multiple SolrCores.

A reload would be strange though - it should only happen if you are replicating config files, and that doesn't make sense in SolrCloud mode.

What does your replication handler definition in solrconfig.xml look like?

Re: Snap pull failed error

Explorer

Each server (total 3) has 128GB RAM with only cloudera search running. There are 3 collections in the server and each collection is about 10GB size. There are no shards (1 shard, 3 replica configuration). Java Max Heap (Xmx) is 40GB.

 

I see this error only when there is continuous writes happens.

 

This is just a solrcloud configuration. I dont have any separate replication configuration. The replication request handler looks like this:

 

<requestHandler name="/replication" class="solr.ReplicationHandler" >
<!--
To enable simple master/slave replication, uncomment one of the
sections below, depending on whether this solr instance should be
the "master" or a "slave". If this instance is a "slave" you will
also need to fill in the masterUrl to point to a real machine.
-->
<!--
<lst name="master">
<str name="replicateAfter">commit</str>
<str name="replicateAfter">startup</str>
<str name="confFiles">schema.xml,stopwords.txt</str>
</lst>
-->
<!--
<lst name="slave">
<str name="masterUrl">http://your-master-hostname:8983/solr</str>
<str name="pollInterval">00:00:60</str>
</lst>
-->
</requestHandler>

Re: Snap pull failed error

Explorer

This is the full exception from the logs:

 

9:33:24.413 PMERRORorg.apache.solr.handler.ReplicationHandler
SnapPull failed :java.lang.RuntimeException: The max direct memory is likely too low.  Either increase it (by adding -XX:MaxDirectMemorySize=<size>g -XX:+UseLargePages to your containers startup args) or disable direct allocation using solr.hdfs.blockcache.direct.memory.allocation=false in solrconfig.xml. If you are putting the block cache on the heap, your java heap size might not be large enough. Failed allocating ~10737.41824 MB.
at org.apache.solr.core.HdfsDirectoryFactory.create(HdfsDirectoryFactory.java:143)
at org.apache.solr.core.CachingDirectoryFactory.get(CachingDirectoryFactory.java:350)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:391)
at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:322)
at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:148)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:405)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:632)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:97)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
at org.apache.solr.store.blockcache.BlockCache.<init>(BlockCache.java:61)
at org.apache.solr.core.HdfsDirectoryFactory.create(HdfsDirectoryFactory.java:140)
... 6 more

9:33:24.413 PMERRORorg.apache.solr.cloud.RecoveryStrategy
Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed.
at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:151)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:405)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)

Re: Snap pull failed error

Cloudera Employee

Thanks for the stack trace - that clears things up. I was thinking there was a reload, because a new HdfsDirectory was being created - but in fact, the reload is not and would not actually cause that necessarily.

 

The issue is that on some replications, Solr cannot use the same index directory it was using - it has to roll over to a new one. This can add another 10GB block cache as two HdfsDirectory's can briefly be around together.

 

This breif overlap is probably what's making you go over your max direct memory setting.

 

I'd use smaller block caches or raise that limit if you think you have the available RAM.

 

Once we have an option for a single memory pool for all the block caches, this will be a little nicer.

Re: Snap pull failed error

Cloudera Employee

I've filed a Solr issue for the global block cache improvement: 

 

You should be able to use one pool of memory for multiple collection's HDFS block caches.

 

 

We will pull that into Cloudera Search in a future release. Until then, you probably either have to raise the direct memory limit or lower the block cache size.

Highlighted

Re: Snap pull failed error

Explorer

Thank you for logging the issue. In order to reduce the block cache size, I should just reduce this setting (solr.hdfs.blockcache.slab.count) right?

Re: Snap pull failed error

Cloudera Employee

Right, you can just reduce it. Each slab is 128MB with the other defaults.

Re: Snap pull failed error

Explorer

Thanks!

Re: Snap pull failed error

New Contributor

Has this problem been resolved?

 

I found bug fix about this jira in cdh5.9.1 release note.

 

We use cdh 5.9.1 But We have same issue.

 

I have confuse.

 

Don't have an account?
Coming from Hortonworks? Activate your account here