I setup hadoop enironment using Cloudera Manager (CM) to develop a POC. When I run an example i get following error
Following mssage is shown in diagnostic of the job
at YarnController.java line 463
in com.cloudera.server.web.cmf.YarnController collectYarnApplicationDiagnostics()
I am also unable to run spark-shell as it hangs showing following in infinite loop
15/06/10 10:56:15 INFO Client: Application report for application_1433782246697_0021 (state: ACCEPTED)
I used following command to run spark-sell
sudo -u hdfs spark-shell --executor-memory 256m --driver-memory 256m --master yarn
When I checked the Job on the queue spark-shell also exist in job list and gives same error as above.
Looks like some misconfiguration with YARN as I read and I did not chnage any default settings whie installing via CM.
Sorry I did not see thi one earlier: Which version of CM are you on? We should never throw a NPE that you can see ;-)
For the spark issue you most likely need to look at the scheduler and make sure you have enough resources to run containers in.
There are simple tests you can run to check your config:
- a simple yarn pi job
- a simple spark pi job
If they pass then we have something to work with.
I am usign CDH 5.4.2. I also tried both
Unfortunately this is what i see with diagnostic data. I ran both example and they do not work.
A server error has occurred. Send the following information to Cloudera.
Version: Cloudera Express 5.4.1 (#197 built by jenkins on 20150509-0041 git: 003e06d761f80834d39c3a42431a266f0aaee736)
The application working and finishing has nothing to do with CM being able to download the diagnostics data.
Check the RM web UI and see if the application has finished and if you can see the logs from the RM web UI or the JHS.
If that is working then we know we have a working YARN install and this question should be moved to the Cloudera Manager forum.
However looking at the error I think I know the cause of the NPE in Cloudera Manager:
In a system that has never had a trial license and was directly configured as an Express license this error occurs.
At least I have been able to reproduce it like that.
I will log an internal bug against CM for that.
BTW: you should not need to use that diagnostics collecection unless we from support ask you to collect one to troubleshoot a failed application.
Thank you for assistance. I am new to Cloudera Manager. Unfortunately I am still unable to get this working. I uninstalled everything and installed with basic packages (hadoop and echo system packages only). All services started but HDFS is having critical error. It says following.
383 under replicated blocks in the cluster. 385 total blocks in the cluster. Percentage under replicated blocks: 99.48%. Critical threshold: 40.00%
Then I did again insalled in another macine locally and I get the same problem.
Just for your query no Job is not ending. It starts and hangs. Following is the what I see if I view in the Resource Manager. Thank you for your help
Application Type: MAPREDUCE
Started: Wed Jun 24 13:46:33 +0100 2015
Elapsed: 22mins, 46sec
Tracking URL: UNASSIGNED
Check the default replication factor: it most likely is set to 3 and you do not have 3 datanodes in the cluster. On small clusters you must ammend it to line up with the cluster size.
hdfs dfs -setrep [-R] [-w] <numReplicas> <path>
To fix any of the files that are already on hdfs and that are "underreplicated".
Also check the "mapred.submit.replication" and lower it to allign with the cluster. If you did not do that the submit will hang/fail.
Thank you very much for the reply. I managed to get this under replicated block fixed as per your advice.
May I ask where can I find the "mapred.submit.replication". As you rightly thought yes I am setting up a single node cluster for a proof of concept.
Thank you again.
Check under the YARN service and you should be able to find it by just searching for it in the Configuration.
It should show under the Gateway.
I found but it is already 1. Paramater name was "mapreduce.client.submit.file.replication"
Mapreduce Submit Replication
Then I further went through config and did change below to 1
Default Number of Parallel Transfers During Shuffle
Still no joy and job hangs.
Check the RM UI and the available resources that you have. Make sure that you have enough vcores and memory for at least two containers: the AM and one executor container. If you have not enough resources then things will hang. Check this blog on known problems and solutions for a case like yours.