Created 08-04-2020 02:50 AM
Hello,
I have a Production cluster and I get out of memory errors for Navigator Metadata Server. This is generating large dump files (33GB last time) in /tmp. Since /tmp has only 50GB it get's full quite fast.
Errors:
The health test result for NAVIGATORMETASERVER_UNEXPECTED_EXITS has become bad: This role encountered 1 unexpected exit(s) in the previous 5 minute(s).This included 1 exit(s) due to OutOfMemory errors. Critical threshold: any.
The health test result for NAVIGATORMETASERVER_DATA_DIRECTORY_FREE_SPACE has become unknown: Not enough data to test: Test of whether the Navigator Metadata Server Storage Dir has enough free space.
There is a formula to estimate the optimal Java Heap Size for Navigator Metadata Server but the problem in my case is that in the cloudera-scm-navigator log file there is no nav_elements nor nav_relations. Is there any way to estimate the optimal java heap size? Current value is set to 24GB.
Formula: ((num_nav_elements + num_nav_relations) * 200 bytes) + 2 GB
Link from Cloudera Docs
Thank you,
Created 08-04-2020 06:47 AM
Hello @md186036 ,
thank you for your questions on
Thank you:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 08-04-2020 06:58 AM
Hello @Bender ,
Thanks a lot for your fast answer.
The CDH version I am using is 5.15.2.
I checked the Development cluster and there I can find the number of elements in the log file but in the Production cluster, it's not possible.
Thank you,
Daniel
Created 08-04-2020 07:09 AM
Hello @md186036 ,
do you see INFO level messages in the prod cluster log? I am suspecting that your log level is set to e.g. WARN or ERROR level/threshold and maybe this is the reason you do not see the "nav_" elements. It can be set via CM's service configuration too
Thank you:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 08-04-2020 07:35 AM
Created 08-04-2020 07:55 AM
Hello @md186036 ,
I have checked the Navigator logs in a test cluster for CDH5.16 (for our purposes this minor version difference should not matter much) and found the entries mentioned in this doc for CDH5.15:
2020-07-29 17:41:18,151 INFO com.cloudera.nav.server.NavServerUtil [main]: Found 885 documents in solr core nav_elements
2020-07-29 17:41:18,155 INFO com.cloudera.nav.server.NavServerUtil [main]: Found 916 documents in solr core nav_relations
I was navigating in CM -> Cloudera Management Service -> Navigator Metadata Server and then clicked on Log Files -> Role log file.
I have downloaded the file by clicking on the "Download Full Log".
I have observed that I have log entries saying INFO messages.
Do you see log entries with INFO in the log too, please?
I would like to rule out that the log level was not applied since your config change (e.g. was not restarted) or the log level was changed in other means (e.g. without restart).
Thank you:
Ferenc
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 08-04-2020 07:58 AM
Hi @Bender ,
Thanks a lot for your research. I will check it and get back to you later today.
Best regards,
Daniel
Created 08-04-2020 12:33 PM
Hi @Bender ,
I do have entries with INFO in the log file but I can't find any with nav_elements. I also have some errors and warnings:
ERROR SparkPushExtractor
[qtp1810923540-17908]: com.cloudera.nav.pushextractor.spark.SparkPushExtractor Error extracting Spark operation.
java.lang.NullPointerException
WARN ApiExceptionMapper
[qtp1810923540-17908]: Unexpected exception.
java.lang.NullPointerException
ERROR SparkPushExtractor
[qtp1810923540-17879]: com.cloudera.nav.pushextractor.spark.SparkPushExtractor Error extracting Spark operation.
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got application/xml. <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="error"><str name="msg">application/x-www-form-urlencoded content length (3820497 bytes) exceeds upload limit of 2048 KB</str><int name="code">400</int></lst>
</response>
I restarted Cloudera Management Server on July 29th.
Thanks,
Daniel
Created 08-05-2020 05:38 AM
Hello @md186036 ,
The error message you pointed out [1] seems to be a known issue and is looked at by the below internal JIRA ticket:
NAV-7272 - NPE in getEpIdentitiesForMissingRelations
As per the JIRA ticket:
"An NPE is being caused by getEpIdentitiesForMissingRelations() during Spark extraction. The condition that causes it is rare, however, once the condition exists, because of the NPE, it will continue forever.
The code is trying to detect ep2Ids for linked relations that are missing so they can be added. However, the code fails to check for null in the case that this is true."
The fix is not available yet in any currently released CDH distribution. The fix might be available in CDH6.4.0, 5.16.3, 6.2.2, 6.3.4, 7.1.1, 5.17.0.
My understanding is that this can cause no new metadata is produced. Should you have a Cloudera Support Subscription, please kindly file a support ticket with us to assist you further, as there is no workaround identified for this bug.
Thank you:
Ferenc
[1]
ERROR SparkPushExtractor [qtp1810923540-17908]: com.cloudera.nav.pushextractor.spark.SparkPushExtractor Error extracting Spark operation. java.lang.NullPointerExceptio
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 08-05-2020 05:44 AM
Hi @Bender ,
Thanks a lot for your answers.
I will open a case in Cloudera Support for it.
Kind regards,
Daniel