Member since
07-31-2013
1924
Posts
462
Kudos Received
311
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1543 | 07-09-2019 12:53 AM | |
9293 | 06-23-2019 08:37 PM | |
8052 | 06-18-2019 11:28 PM | |
8677 | 05-23-2019 08:46 PM | |
3473 | 05-20-2019 01:14 AM |
02-13-2019
08:12 PM
The feature in C6.x is implicit and aimed to support easier rolling upgrades (when the job jars are part of the job exclusively, changes to locally installed binaries will not affect it during upgrades). A release note item is documented here for this: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cm_600_new_features.html#concept_qpj_jrq_v2b
... View more
02-12-2019
07:44 AM
1 Kudo
> Does the latter overwrite the former for mapreduce applications? No, at least as of CDH 5.x, the two are additive. The yarn.application.classpath value goes on early (adding Common, HDFS and YARN), followed by mapreduce.application.classpath (adding just MR2). The reason they are separate is tied to another feature (available in CM 6.x) that lets you supply all framework jars as an archive along with the job rather than rely on local, pre-installed locations on all worker hosts that are subject to change anytime outside of a container's runtime. > There is also variable MR2_CLASSPATH that is included by default in mapreduce.application.classpath. Where is taken from? This is exclusive to Cloudera Manager managed environments, and is a reserved env-var name used to assist Parcels that may choose to supply some jars as 'plugins' to an app or a service. All such env-vars are listed here: https://github.com/cloudera/cm_ext/wiki/Plugin-parcel-environment-variables. In most cases you can ignore this env-var, as it will be empty usually. > Is the mapreduce.application.classpath relevant only for gateways from were application is submitted to yarn? No, the values are just variable names, and are not substituted at the gateway. They are substituted only on the NodeManager when the prepared container command/script actually executes. This lets you manage different install paths on different worker hosts, where local environments point to actual locations of jars.
... View more
12-03-2018
09:33 PM
This may be a very basic question but I ask because it is unclear from the data you've posted: Have you accounted for replication? 50 GiB of HDFS file lengths summed up (hdfs dfs -du values) with 3x replication would be ~150 GiB of actual used space on the physical storage. The /dfs/dn is where the file block replicas are stored. Nothing unnecessary is retained in HDFS, however a common overlooked item is older snapshots retaining data blocks that are no longer necessary. Deleting such snapshots frees up the occupied space based on HDFS files deleted after the snapshot was made. If you're unable to grow your cluster, but need to store more data, then you may sacrifice availability of data by lowering your default replication to 2x or 1x (via dfs.replication config for new data writes, and hdfs dfs -setrep n for existing data).
... View more
11-25-2018
06:34 PM
1 Kudo
Hi @ludof, > Do I need to define the queues "oozie_launcher_queue" and "job_queue" somewhere on the CDH or can I just use them providing the names? If yes, how should I define these queues? There are recommended settings? This depends on what you'd like to control. While simply specifying different existing (or auto-created) queues will solve your immediate problem, you can further control the total # of jobs Oozie can run in parallel by applying restrictions over "oozie_launcher_queue" queue settings in FairScheduler. > In case of a Spark action, do I still to specify the queue? If yes, with which property (since Spark does not use MapReduce)? The Spark "action" part is still launched by Oozie via the standard MapReduce 1-map job launcher. The properties of oozie.launcher.* still applies here. > Does it make sense to specify values for oozie.launcher.mapreduce.map.java.opts, oozie.launcher.mapreduce.reduce.java.opts, oozie.launcher.yarn.app.mapreduce.am.command-opts as I did in the example? Yes it is OK to specify those (and/or the *.memory.mb variants for direct resource request controls) to reduce the footprint of the usually-light Oozie launcher jobs.
... View more
11-23-2018
08:52 AM
2 Kudos
Your port for broker is incorrect in that command, you're supplying the ZooKeeper port of 2181 in an argument that requires the Broker client port of 9092. Follow our guide at https://www.cloudera.com/documentation/kafka/latest/topics/kafka_command_line.html for using the command line tools.
... View more
11-19-2018
02:46 AM
1 Kudo
Oozie 4 and below makes this scenario special because its launcher task containers actually act as drivers for further application launches. Since these launcher task containers are singular map tasks, they are not considered as Application Masters by YARN, the default defence of limiting the AM share of memory and CPU in the pool to avoid deadlock starvation does not work. This results in situations where enough Oozie launcher jobs take up all of the memory/CPU resources, leaving no room for actual job requests to pass and start a container. The best way to solve this would be to have your Oozie launchers go to a different pool than the applications it runs. You can achieve this by configuring oozie.launcher.mapreduce.job.queuename in the workflow actions to a pool different than where the apps are intended to go to. Another way to solve it is to use Uber mode for Oozie launchers, but this comes with a few faults around configuring and use of native libraries, classpath isolation, etc. and is not recommended anymore. In Oozie 5+ the special Oozie AM application can help avoid this, as it will be considered part of the AM share of any pool: https://issues.apache.org/jira/browse/OOZIE-1770
... View more
11-18-2018
08:49 PM
@maziyar, Are you sure you have "Always Use Linux Container Executor" checked and "Limit Nonsecure Container Executor Users" unchecked, and no safety valves overriding relevant properties? What CDH and CM version are you running?
... View more
10-31-2018
10:47 PM
1 Kudo
This feature arrived with YARN-7157, and is available in CDH 6.0 and higher versions. You'll need to specify "yarn.resourcemanager.display.per-user-apps" as "true" in the YARN Resource Manager yarn-site.xml safety valve to activate it on CDH6.
... View more
10-24-2018
08:10 PM
The /tmp/file.txt is a HDFS path, not a local path. It could be any HDFS path if your Spark is configured to use HDFS as a default FS - I used /tmp just for illustration. The same should work on local FS modes too, but I've not tried it.
... View more
10-22-2018
06:23 PM
1 Kudo
@zbz, The group ownership of all directories under /tmp/logs must be 'hadoop' or any group ID that's common between the 'yarn' and 'mapred' IDs. In your case you have it as supergroup, which does not have 'mapred' as its member, but is also the entirely wrong group to use - you do not want to grant HDFS superuser access to YARN service. I'd recommend removing 'yarn' from the 'supergroup' group. This is what a normal installation should appear as: # id -Gn mapred mapred hadoop # id -Gn yarn yarn hadoop # hadoop fs -ls -d /tmp/logs drwxrwxrwt - mapred hadoop 0 2017-08-30 22:36 /tmp/logs So if the 'hadoop' group is shared by your two IDs (mapred and yarn) then you may execute the below (as a HDFS superuser) to resolve the issue permanently: hadoop fs -chgrp -R hadoop /tmp/logs
... View more