Member since
01-16-2014
336
Posts
43
Kudos Received
31
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1818 | 12-20-2017 08:26 PM | |
1831 | 03-09-2017 03:47 PM | |
1646 | 11-18-2016 09:00 AM | |
2361 | 05-18-2016 08:29 PM | |
2078 | 02-29-2016 01:14 AM |
04-06-2016
10:45 AM
There are two settings that you need to look at: - yarn.scheduler.maximum-allocation-mb sets the maximum size of a container - yarn.nodemanager.resource.memory-mb sets the maximum size of the memory available on the node When a request comes in for a container that is larger than the maximum-allocation-mb it will be denied. The application can not be submitted. If you have 240 GB in the host I would expect that the nodemanager would get about 200 GB of that if you just run Yarn on the node. That should allow you to run more than one large container as you have. However running with a 40GB container for MR seems a bit over the top: do you really need all that? If you use DRF than you might not have a memory limitation but a vcores limitation. You have not mentioned anything about that side so I am not sure what you have configured and if that might be the problem or not. There are also things like the number of applications that can be run and the AM share which could be of influence on what you see. There is a series of blog posts out that should help also with this it starts with: http://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/ Three parts are out currently part 4 is coming real soon... If you need more help open a support case with us and we can work through setting up the scheduler with you. Wilfred
... View more
03-10-2016
12:58 AM
Spark SQL is supported in CDH 5.5, with some limitations. One of those things that we do not support is the hive thriftserver, see: CDH 5.5 docs The thriftserver dependency on a version of hive and what we ship in CDH is still a problem. Wilfred
... View more
03-06-2016
04:12 PM
You most likely have pulled in too many dependencies when you build your application. When you look at the gradle documentation for building it shows that it behaves differently than the maven. When you pack up an application gradle includes far more dependencies than maven. This could have pulled in dependencies which you don't want or need. Make sure that you only have in the application what you really need and that is not provided by hadoop. Search for gradle and dependency management. You need some way to define a "provided" scope in gradle. Wilfred
... View more
03-01-2016
01:03 AM
You can not just replace a file in HDFS and expect it to be picked up. The files will be localised during the run and there is a check to make sure that the files are were they should be. See the blog on how the sharelib works. The OOTB version of Spark that we deliver with CDH does not throw the error that you show. It runs with the provided http client so I doubt that replacing the jar is the proper solution. It most likely is due to a mismatch in one of the other jars that results in this error. Wilfred
... View more
02-29-2016
05:12 PM
You will need to change the build to pull in the right version as documented on the Spark pages. The maven repository information for CDH is documented in our generic docs. You would probably get something like -Dhadoop.version=2.6.0-cdh5.4.0 Wilfred
... View more
02-29-2016
01:14 AM
1 Kudo
I would use the spark action as much OOTB as possible leverage sharelib for since handles a number of things for you. You can use multiple versions of sharelib as described here check for overriding the sharelib. Wilfred
... View more
02-25-2016
09:03 PM
You should not ,and can not, rely on the joda version that the AWS SDK brings in. If they use a shaded version than you can not reach it and you would not see it. If they have an unshaded version then you need to shade your version. You need to declare it as your dependency and then shade your version in your build. It is not the simplest thing to figure out, specially if you have never done it before, but after sorting it once it should not cost you anything and probably make upgrades of hadoop and maintenance of your application simpler. Wilfred
... View more
02-25-2016
08:56 PM
1 Kudo
If you need a specific version of guava you can not just add it to the classpath. If you do you totally rely on the randomness that is in the class loaders. There is no guarantee that you will get the proper version of guava loaded. First thing that you need to do is make sure that you get the proper version of guava loaded at all times. To do this the proper way is to shade (mvn) or shadow (gradle) your guava. Check the web on how to do this. It is really the only way to make sure you get the correct version and not break the rest of hadoop at the same time. After that is done you need to use the class path addition as discussed earlier and make sure that you add your shaded version. This i sthe only way to do this without being vulnerable for changes in the hadoop dependencies. Wilfred
... View more
02-25-2016
08:20 PM
If you need a version of a library that is already part of hadoop I would strongly recommend that you include your version of the library in a shaded form in your application. The shading will make sure that you get your version and that it will not interfere with existing versions. We are currently writing a knowledge base article on how to do this and for know you will need to check online for "maven shade" or "gradle shadow" depending on how you build your application. Wilfred
... View more
02-25-2016
07:46 PM
Hive on Spark is not officially supported and what you see is a one of those cases. Certain queries are slower, take more memory or fail. That is why it is not supported yet. We are working hard to fix and tune these use cases. Until that is done the only workaround is to fall back on the MR execution engine. Wilfred
... View more
02-25-2016
07:42 PM
Yes it should use Spark when you do that. There is nothing else that you need to do to run Hive on Spark. Keep in mind that it is not officially supported for production. Spark normally runs on top of YARN and you should thus see a Spark application in the RM that was run. You can also check the Spark JHS for the Spark data. Wilfred
... View more
02-25-2016
07:39 PM
We highly recommend that you use the Spark action and not the shell action for Spark. Also make sure that you configure the gateways for Spark on the system. If you need more detail you will need to provide a little more about what you are doing. Wilfred
... View more
02-25-2016
07:26 PM
The thrift server in Spark is not tested, and might not be compatible, with the Hive version that is in CDH. Hive in CDH is 1.1 (patched) and Spark uses Hive 1.2.1. You might see API issues during compilation or run time failures due to that. Wilfred
... View more
02-25-2016
07:19 PM
The second version of Spark must be compiled against the CDH artifacts. You can not pull down a generic build from a repository and expect that it works (we know it has issues). You would thus need to compile your own version of Spark and use the correct version of CDH to do it against. Using Spark from a later or earlier CDH release will not work, most likely due to changes in dependant libraries (i.e. hadoop or hive version). For the shuffle service and the history service: they both are backwards compatible and only one of each is needed (running two is difficult and not needed). However you must run/configure only the one that comes with the latest version of Spark in your cluster. There is no formal support for this and client configs will need manual work... Wilfred
... View more
02-10-2016
06:30 PM
Node labels is not considered ready by Cloudera or even by the upstream community. The basis for node labels was added to Hadoop 2.6 with a large number of limitations. The only scheduler that currently implements node labels support is the CapacityScheduler. None of the other schedulers supports it yet. Cloudera recommends, for a number of reasons, that you use the FairScheduler in your cluster. Setting up node labels is partially supported through the command line interface but it still requires manual steps and configuration. Support for labels is also limited to one (1) label per YARN application. Using labels requires you to add them on the command line when an application is submitted. MapReduce does not implement any of the node label support yet (MAPREDUCE-6304) in the current release. Node labels due to its limited implementation can also cause a large increase in scheduling delays which makes using them counter productive. We are working with the community to make node labels ready for production but currently it is not there. Wilfred
... View more
01-24-2016
04:21 PM
The NM loads its configuration on startup and then reregisters with the RM. So the only thing that should be needed is update the configuration in CM for the node(s) and restart the NM service on the nodes. I have done that numerous times and it always works. Wilfred
... View more
01-24-2016
03:09 PM
1 Kudo
A vcore is a virtual core. You can define it however you want. You could, as an example, define that a vcore is the processing power that is delivered by a 1GHz thread core. A 3GHz core would than be comparable to 3 vcores in the node manager. Your container request then needs to use multiple vcores which handles the difference in speed. Not a lot of clusters do this due to the administrative overhead and the fact that if the end users do not use the vcore correctly it can overload the faster machines. Wilfred
... View more
01-20-2016
07:20 PM
The codec is responsible for the reads and you will need to talk to the creator of the codec to provide you with the information on why this is happening. Wilfred
... View more
01-20-2016
07:18 PM
We are aware of that upstream bug and found it during our internal performance testing, around the time we released CDH 5.5.1 It will be included in an upcoming CDH 5.5 release. Wilfred
... View more
01-20-2016
07:12 PM
1 Kudo
You need to setup the nodes with the proper vcores and memory available for the NM. That should solve the problem. It will put more load on the larger nodes than on the small nodes. The container is also scheduled on the node based on the data locality which is out of your control. You can however not say start processing of the split on a specific node. Wilfred
... View more
01-20-2016
07:07 PM
Use the yarn logs -applicationId APP_ID command to grab the executor logs so you can get some more detail on what is failing. APP_ID needs to be replaced with the application ID of your application in your example: application_1450777964379_0027 Wilfred
... View more
01-20-2016
06:53 PM
1 Kudo
CDH 5.3 does not come with Spark 1.5. You are running an unsupported cluster. Please be aware of that. Weight has nothing to do with the pre-emption. It is a common misunderstanding. The weight is just to decide which queue gets a higher priority during the scheduling cycle. So if I have queues with the weights 3:1:1 then from every 10 schedule attempts 6 will go to the queue with weight 3 and 2 attempts will be for each queue with weight 1, totalling 10 attempts. Minimum share preemption works only if you have the minimum and maximum shares for a queue set. Make sure you have that. The fair share of a queue is calculated based on the demand in the queue (i.e. the applications that are running in the queue). You thus might not be hitting the fair share preemption threshold.... Wilfred
... View more
01-17-2016
09:17 PM
Those settings do not work on YARN so the no effect is expected. Check the Spark standalone doc. For YARN the cleanup should be automatic and triggered by a shutdown and proper clean up of the context. Which versin of CDH are you running and how have you configured the shuffle? Wilfred
... View more
01-17-2016
06:24 PM
The spark job server is not a Cloudera provided application. You will need to get support from the team that hosts the code at the main dev branch. That said I can see one huge problem: you try to use a job server version 0.4 which is for an older release of Spark (1.0.2) than you have in CDH 5.4 (1.3.x). Make sure that you use the proper version and fix your project compilation etc. Also Spark in CDH uses a base version of Spark and adds fixes on top of that. You might thus need a slightly different version of the job server than you think. It is all up to you to make sure it works for your use case and is stable. We are working on an equivalent job server as part of CDH. Wilfred
... View more
01-17-2016
02:32 PM
No we do not support the thrift server as per the documentation: CDH 5.5 Spark release note Hive on Spark is also still in beta and we are finishing features as per Hive CDH 5.5 release note it is thus experimental and things might not work. We can not provide guidance on the road map for features that are not yet complete Wilfred
... View more
09-28-2015
08:48 PM
With the --files option you put the file in your working directory on the executor. You are trying to point to the file using an absolute path which is not what files option does for you. Can you use just the name "rule2.xml" and not a path. When you read the documentation for the files. See the important note at the bottom of the page running on yarn. Also do not use the Resources.getResource() but just use a open of a java construct like: new FileInputStream("rule2.xml") or something like it. Wilfred
... View more
09-28-2015
08:31 PM
How to do this is documented here: running on yarn. You need to pass in a custom log4j.properties file. With rolling logs you will most likely lose the yarn log tracking and aggregation. I am not sure that this will properly work. The container will most likely keep pointing to the base file and never move to the rolled version or you will only be able to ever track the current one. Wilfred
... View more
09-28-2015
08:08 PM
To rule out that we have a custom jar issue can you run the pi example to make sure that the cluster is (not) setup correctly? We have documented how to run a spark application, with the example in our docs. The error that you show points to a classpath error and you can not find the Spark classes on your class path. WIlfred
... View more
09-28-2015
07:59 PM
1 Kudo
Files should be moved from done_intermediate to the done directory during the normal running of the JHS. Two things to check: - does the JHS show any errors in the logs? - run the following command on the host that runs the JHS: id -Gn mapred it should show as an output: "mapred hadoop" That is assuming that the JHS runs as the mapred user if it runs as another user replace the mapred in the id command. Wilfred
... View more
09-28-2015
12:28 AM
Can you make sure the JHS is up and running and when you look for the job in the UI that it can find it? You can also try the command line to see if it is on HDFS? yarn logs -applicationId APP_ID -appOwner USER_ID The APP_ID is the same as the job ID that you showed but with job replaced by application the USER_ID is the ID of user that ran the job Wilfred
... View more