Member since
04-22-2014
1218
Posts
342
Kudos Received
157
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 28526 | 03-03-2020 08:12 AM | |
| 18968 | 02-28-2020 10:43 AM | |
| 5249 | 11-12-2019 03:28 PM | |
| 7639 | 11-01-2019 09:01 AM | |
| 7362 | 08-12-2019 11:06 AM |
03-24-2020
01:03 PM
Hi @WilsonLozano , Based on the fact that the ldapsearch command returned the object without issue, we can conclude that the bind user and password are correct. Thus, I believe we can assume that the issue may involve referrals and how they are being followed. I find this odd since I believe that ldapgroupsmapping should have referral following off by default. Nonetheless, we see in your ldapsearch result: ref: ldap://DomainDnsZones.sub.us.domain.local/DC=DomainDnsZones,DC=sub,DC=u s,DC=domain,DC=local So, what I would suggest trying is either: Change your search base to something more specific like "OU=Accounts,DC=sub,DC=us, DC=domain,DC=local" so that no referral is returned from Active Directory Try using the Global Catalog (port 3268 (non-TLS)) I am pretty confident that referrals are involved, but I don't know why hadoop commons would be following them. Another thing you could do is use "tcpdump" to capture packets on port 389 and then use WireShark to decode them. That would show us exactly what the client is trying to do and the response (in terms of LDAP protocol).
... View more
03-23-2020
01:36 PM
@WilsonLozano, I believe the error you are getting indicates that the bind user defined in hadoop.security.group.mapping.ldap.bind.user does not existing in the LDAP server, but I didn't search online for confirmation. You could try using ldapsearch to test something like this: ldapsearch -x -H ldap://sub.us.domain.local:389 -D "ClouderaManager@SUB.US.DOMAIN.LOCAL" -W -b "DC=sub,DC=us,DC=domain,DC=local" "(&(objectClass=user)(sAMAccountName=c12345a))" If the above returns an error, you can try using debugging in ldapsearch to get a clearer picture what failed by using the "-d1" option in the command above (after -W for instance).
... View more
03-03-2020
08:12 AM
1 Kudo
That's great! You should be able to replace the "NOT FOUND" values for those two fields with: -Djava.net.preferIPv4Stack=true This will configure it as CM usually has by default. Not sure how the NOT FOUND ended up there.
... View more
03-02-2020
11:33 AM
@HadoopBD, I was able to reproduce your symptoms based on what I saw in the debug on my successful run. Although I am sure there are a few ways this could happen, here is how I was able to get the same failure: [2020-03-02 18:56:45.154]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : Error: Could not find or load main class NOT HOW I REPRODUCED THE ERROR: (1) In Cloudera Manager, open YARN configuration (2) Search for Map Task Java Opts Base (3) Update Map Task Java Opts Base by adding a space and then the word "NOT" For example: -Djava.net.preferIPv4Stack=true NOT (4) Save change. Deploy Client Config Restarted YARN (probably not necessary) (5) Ran hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 4 4 (6) Result was the error. When I captured the application logs with the debugging I mentioned enabled, I could see that launch_container.sh issued the following Java command: exec /bin/bash -c "$JAVA_HOME/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Djava.net.preferIPv4Stack=true NOT -Xmx820m ... Since the word "NOT" does not have an option in front of it, Java interprets this as the class that should be run. Based on the above example, I would say that increasing container log deletion delay and enabling debug as I described in previous posts will show us the problem. Cheers, Ben
... View more
03-02-2020
10:23 AM
Hello @HadoopBD , Thanks for providing the logs, but they do not contain what we would expect if you had followed the steps to enabled container launch debug information. I am guessing you missed my steps during the threaded conversation. Basically, the standard logs show you some information, but not all. We are missing the actual files and log information about how the "launch_container" processes was started and what was passed to the script use to execute the necessary java. In order to capture that information, which will most likely give us some sort of clue about the cause of this issue. The steps to retain container launching information and also allow "yarn logs" command to obtain them is in CM 6.3 so I wanted to find out if you had that version. Here are the steps: If you are on Cloudera Manager 6.3 or higher, you can try the following to collect more information about the container launch: (1) Via Cloudera Manager, set the following configuration to 600 (10 minutes): Localized Dir Deletion Delay. This will tell the Node Manager to delay 10 minutes before cleaning up the container launcher. This will help us review the files used in the failed container launch (2) Set the following YARN configuration: Enable Container Launch Debug Information. Check the box to enable it. This will allow you to collect extra container launch information in the "yarn logs -applicationId" output. (3) SAVE your changes and then Restart YARN service from CM (4) Run a test mapreduce job (pi for instance) (5) After it fails, run the following to collect the aggregated logs for the job: yarn logs -applicationId <app_id> NOTE: you can direct the output to a file so you can search in the file. (6) Look for "launch_container" in the output to find the launch information. I just ran through a test and a lot more details about how the command will be launched is available. I truly believe it will help us assess a cause so we can find a solution.
... View more
02-28-2020
04:38 PM
@Dombai_Gabor, I'm sorry to hear that... I think you mean that the OS won't boot; if so, let us know what happens and perhaps we can help. I'm not too familiar with debugging tactics of OS boot off hand, but others might be able to provide some insight.
... View more
02-28-2020
10:43 AM
1 Kudo
Hi @Dombai_Gabor , One possible cause of this issue is that the volume is mounted with "noexec". Since your permissions and group membership seem correct, it is reasonable to check /etc/fstab to see if "noexec" is set where /var/ mounted. Ben
... View more
02-28-2020
09:40 AM
Hello @HadoopBD , It appears to me that the log you provided may not have been captured with the steps I suggested before EricL's comment. Can you confirm how you retrieved these logs. From what I see, this only hints that there are problems launching containers, but in order to try to see why, we will need to capture more information. Thanks, Ben
... View more
02-27-2020
02:55 PM
1 Kudo
Hi @HadoopBD , The logs provided don't contain the environment or any clues that would help us understand what may have been passed to the command that is attempting to launch a container. It might have been missed in my previous message, but a good way of getting some more detail about the Application Masters, containers, etc. would be to collect logs via the "yarn logs" command. For instance: yarn logs -applicationId application_1582677468069_0009 > application_1582677468069_0009.log Resource Manager logs tell us some things, but not the whole picture. If you can run the above, the output may be pretty big, but if you can take a look and see if you can find the string "NOT" in there that might be a start. If you are on Cloudera Manager 6.3 or higher, you can try the following to collect more information about the container launch: (1) Via Cloudera Manager, set the following configuration to 600 (10 minutes): Localized Dir Deletion Delay. This will tell the Node Manager to delay 10 minutes before cleaning up the container launcher. This will help us review the files used in the failed container launch (2) Set the following YARN configuration: Enable Container Launch Debug Information. Check the box to enable it. This will allow you to collect extra container launch information in the "yarn logs -applicationId" output. (3) SAVE your changes and then Restart YARN service from CM (4) Run a test mapreduce job (pi for instance) (5) After it fails, run the following to collect the aggregated logs for the job: yarn logs -applicationId <app_id> NOTE: you can direct the output to a file so you can search in the file. (6) Look for "launch_container" in the output to find the launch information. Again, the output might be pretty big so you can try adding it here or look for things that may be relevant.
... View more
02-26-2020
02:43 PM
Hello @HadoopBD , Sorry to hear this has been causing you trouble. I'll see if I can help with the investigation. A few things you mention that are relevant: - All examples fail, so that supports the assumption that it is something the jobs have in common that is contributing this issue. - It also appears that the issue happen during container launch (as prelaunch.err contains the error) - The failure indicates that somehow the class name was erroneously evaluated to the string "NOT" as seen here: Error: Could not find or load main class NOT Since there is no class named "NOT" that implies whatever evaluation was done to attempt to execute that class was incorrect. For instance, you can get the same result if you run the following: > java NOT Error: Could not find or load main class NOT So, the question goes to what happened before an attempt was made to launch a container. One thing that can be an influence is the environment (env variables). If this is an out-of-the-box installation, we would not expect this to happen, so if you have updated any YARN or HDFS configuration, it would be good note. We could use a bit more information, so I would suggest getting the logs for the application like this: # yarn logs -applicationId application_1582677468069_0009 > application_1582677468069_0009.log This should allow you to look at all the logs for this application including any information that may have been missing from the job stdout. Also consider trying to run the job from the same host as the Resource Manager to see if the failure is the same. If there is something related to your client environment or hadoop configuration, the test may highlight that type of issue. Cheers, Ben
... View more