Member since
04-22-2014
1218
Posts
341
Kudos Received
157
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
22375 | 03-03-2020 08:12 AM | |
12929 | 02-28-2020 10:43 AM | |
3722 | 12-16-2019 12:59 PM | |
3297 | 11-12-2019 03:28 PM | |
5100 | 11-01-2019 09:01 AM |
02-28-2020
10:43 AM
Hi @Dombai_Gabor , One possible cause of this issue is that the volume is mounted with "noexec". Since your permissions and group membership seem correct, it is reasonable to check /etc/fstab to see if "noexec" is set where /var/ mounted. Ben
... View more
02-28-2020
09:40 AM
Hello @HadoopBD , It appears to me that the log you provided may not have been captured with the steps I suggested before EricL's comment. Can you confirm how you retrieved these logs. From what I see, this only hints that there are problems launching containers, but in order to try to see why, we will need to capture more information. Thanks, Ben
... View more
02-27-2020
02:55 PM
1 Kudo
Hi @HadoopBD , The logs provided don't contain the environment or any clues that would help us understand what may have been passed to the command that is attempting to launch a container. It might have been missed in my previous message, but a good way of getting some more detail about the Application Masters, containers, etc. would be to collect logs via the "yarn logs" command. For instance: yarn logs -applicationId application_1582677468069_0009 > application_1582677468069_0009.log Resource Manager logs tell us some things, but not the whole picture. If you can run the above, the output may be pretty big, but if you can take a look and see if you can find the string "NOT" in there that might be a start. If you are on Cloudera Manager 6.3 or higher, you can try the following to collect more information about the container launch: (1) Via Cloudera Manager, set the following configuration to 600 (10 minutes): Localized Dir Deletion Delay. This will tell the Node Manager to delay 10 minutes before cleaning up the container launcher. This will help us review the files used in the failed container launch (2) Set the following YARN configuration: Enable Container Launch Debug Information. Check the box to enable it. This will allow you to collect extra container launch information in the "yarn logs -applicationId" output. (3) SAVE your changes and then Restart YARN service from CM (4) Run a test mapreduce job (pi for instance) (5) After it fails, run the following to collect the aggregated logs for the job: yarn logs -applicationId <app_id> NOTE: you can direct the output to a file so you can search in the file. (6) Look for "launch_container" in the output to find the launch information. Again, the output might be pretty big so you can try adding it here or look for things that may be relevant.
... View more
02-26-2020
02:43 PM
Hello @HadoopBD , Sorry to hear this has been causing you trouble. I'll see if I can help with the investigation. A few things you mention that are relevant: - All examples fail, so that supports the assumption that it is something the jobs have in common that is contributing this issue. - It also appears that the issue happen during container launch (as prelaunch.err contains the error) - The failure indicates that somehow the class name was erroneously evaluated to the string "NOT" as seen here: Error: Could not find or load main class NOT Since there is no class named "NOT" that implies whatever evaluation was done to attempt to execute that class was incorrect. For instance, you can get the same result if you run the following: > java NOT Error: Could not find or load main class NOT So, the question goes to what happened before an attempt was made to launch a container. One thing that can be an influence is the environment (env variables). If this is an out-of-the-box installation, we would not expect this to happen, so if you have updated any YARN or HDFS configuration, it would be good note. We could use a bit more information, so I would suggest getting the logs for the application like this: # yarn logs -applicationId application_1582677468069_0009 > application_1582677468069_0009.log This should allow you to look at all the logs for this application including any information that may have been missing from the job stdout. Also consider trying to run the job from the same host as the Resource Manager to see if the failure is the same. If there is something related to your client environment or hadoop configuration, the test may highlight that type of issue. Cheers, Ben
... View more
02-26-2020
10:43 AM
1 Kudo
Hi @alcarin_ducil , To answer one question: Cloudera applications that are written in java use a Java Keystore that is in the JKS format; Cloudera Manager is a Java application so it uses a JKS file to determine trust in TLS handshakes. Based on the error snippet you supplied, it appears that the operation whereby the destination cluster's Cloudera Manager instance attempted to make a connection to the source cluster's Cloudera Manager. When doing so, the TLS handshake failed and the following was presented: javax.net.ssl.SSLHandshakeException: SSLHandshakeException invoking https://destCMhostname:7183/api/v1/users: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target In this particular case, the BDR destination (target) cluster Cloudera Manager server must trust the signer of the certificate presented by the source Cloudera Manager; as the error indicates, this trust could not be found. One possible cause of this situation is that Cloudera Manager is using a file as a store for certificate signers that does not contain trust for the signer. Cloudera Manager will use Administration --> Settings --> Security --> Cloudera Manager TLS/SSL Client Trust Store File unless it is not configured. If it is not configured, it will use the JDK's jssecacert; if that does not exist, then the default cacert file is used. See the following for more background: https://docs.oracle.com/javase/8/docs/technotes/guides/security/jsse/JSSERefGuide.html#X509TrustManager I would recommend checking (using keytool -list -v -keystore /path/to/keystore/file) to make sure you have the signing certificate for the source cluster in the JKS files. The key is understanding the chain of trust. Are you using a self-signed certificate (where the signer is the same as the subject (in which case, yes, you should be able to use the same certificates for the server and clients. If you used another certificate authority to sign your server certificates, however, you will need to add that signer's public certificate to the truststores. Let's start there and see if that helps. If not, please let us know. Regards, Ben
... View more
02-07-2020
10:36 AM
Hi @HKG , It seems you may have deleted the wrong file. In my instructions I suggested trying to remove /etc/cloudera-scm-server/db.mgmt.properties However, you deleted: db.properties "db.properties" contains configuration information for Cloudera Manager's connection to the backing SQL database, so CM won't start without it. db.mgmt.properties is not going to influence Hue connections as the Hue DB configuration. Please restore the db.properties file so CM can start. After that, open a new thread with "Hue" and "Cloudera Manager" labels, include a screen shot of the problem as you observe it, and we should be able to help out. Cheers, Ben
... View more
12-16-2019
01:01 PM
1 Kudo
@VijayM, It seems when I posted, my smiley after "that was quite a gap in our conversation" disappeared. I wanted to be sure you knew it was supposed to be there 🙂
... View more
12-16-2019
12:59 PM
@VijayM, That was quite a gap in our conversation 🙂 You are almost perfectly correct in your interpretation of the options: External Only (with emergency Administrator access) means that FULL ADMINISTRATORS and USER ADMINISTRATORS can authenticate using the CM database. External Only (without emergency Administrator access) means that no user can authenticate to the CM database. "Emergency Access" is exactly what it means. If your LDAP database went down or something like that you would still have a way to authenticate to CM to manage the configuration or users accounts. Any users who are not given the "full" or "user" administrator role will not have access to the CM UI. The Description next to the "Authentication Backend Order" configuration option explains it: The order in which authentication back ends are used for authenticating a user. Emergency Administrator Access allows Full and User Administrators in the local database to authenticate if external authentication is not functioning. Regards, Ben
... View more
11-25-2019
09:03 AM
Hi @AstroPratik , First, in order for us to provide the best help, we need to make sure we have information about the issue you are observing. My guess is you are seeing the same health alert in Cloudera Manager, but we also need to confirm you are seeing the same messages in the agent log. If so, please follow the instructions to provide a thread dump via the SIGQUIT signal. The instructions I provided for the "kill -SIGQUIT" command only work in Cloudera Manager 5.x. If you are using CM 6, you can use the following: kill -SIGQUIT $(systemctl show -p MainPID cloudera-scm-agent.service 2>/dev/null | cut -d= -f2) If you do run the kill SIGQUIT make sure to run it a couple times so we can compare snapshots AND make sure you get the thread dump when the problem is occurring. NOTE: After reviewing the previous party's thread dump, it appears that a thread that is spawned to collect information for a diagnostic bundle is slow in processing; the thread that uploads service and host information to the Host and Service Monitor servers also seems to be slow. Since the process of obtaining a diagnostic bundle is something that does not happen often, it is likely that the bundle creation is triggering the old event. There are a number of possible causes for "firehose" trouble, though, so it is important that we understand the facts about your situation before making any judgements.
... View more
11-20-2019
08:41 AM
2 Kudos
Hi @wert_1311 , It has been a long time since I looked at sizing, so I forget the specifics. The following rule has helped in the past, though: (4 x fsimage_size) + 3GB So, if your fsimage is 980, 7 or 8GB should be appropriate as a starting point. The Reports Manager downloads the fsimage from the NameNode and then parses/indexes it, so this it needs enough heap to fit the fsimage in memory plus overhead for the indexing process. Based on what you have collected, increasing the Reports Manager heap sounds like the right call. Just make sure you have enough free memory on the host before increasing the heap. Ben
... View more