About Harsh J

Harsh J · ‎02-13-2019

The feature in C6.x is implicit and aimed to support easier rolling upgrades (when the job jars are part of the job exclusively, changes to locally installed binaries will not affect it during upgrades). A release note item is documented here for this: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cm_600_new_features.html#concept_qpj_jrq_v2b

Harsh J · ‎02-12-2019

> Does the latter overwrite the former for mapreduce applications? No, at least as of CDH 5.x, the two are additive. The yarn.application.classpath value goes on early (adding Common, HDFS and YARN), followed by mapreduce.application.classpath (adding just MR2). The reason they are separate is tied to another feature (available in CM 6.x) that lets you supply all framework jars as an archive along with the job rather than rely on local, pre-installed locations on all worker hosts that are subject to change anytime outside of a container's runtime. > There is also variable MR2_CLASSPATH that is included by default in mapreduce.application.classpath. Where is taken from? This is exclusive to Cloudera Manager managed environments, and is a reserved env-var name used to assist Parcels that may choose to supply some jars as 'plugins' to an app or a service. All such env-vars are listed here: https://github.com/cloudera/cm_ext/wiki/Plugin-parcel-environment-variables. In most cases you can ignore this env-var, as it will be empty usually. > Is the mapreduce.application.classpath relevant only for gateways from were application is submitted to yarn? No, the values are just variable names, and are not substituted at the gateway. They are substituted only on the NodeManager when the prepared container command/script actually executes. This lets you manage different install paths on different worker hosts, where local environments point to actual locations of jars.

Harsh J · ‎12-03-2018

This may be a very basic question but I ask because it is unclear from the data you've posted: Have you accounted for replication? 50 GiB of HDFS file lengths summed up (hdfs dfs -du values) with 3x replication would be ~150 GiB of actual used space on the physical storage. The /dfs/dn is where the file block replicas are stored. Nothing unnecessary is retained in HDFS, however a common overlooked item is older snapshots retaining data blocks that are no longer necessary. Deleting such snapshots frees up the occupied space based on HDFS files deleted after the snapshot was made. If you're unable to grow your cluster, but need to store more data, then you may sacrifice availability of data by lowering your default replication to 2x or 1x (via dfs.replication config for new data writes, and hdfs dfs -setrep n for existing data).

Harsh J · ‎11-25-2018

Hi @ludof, > Do I need to define the queues "oozie_launcher_queue" and "job_queue" somewhere on the CDH or can I just use them providing the names? If yes, how should I define these queues? There are recommended settings? This depends on what you'd like to control. While simply specifying different existing (or auto-created) queues will solve your immediate problem, you can further control the total # of jobs Oozie can run in parallel by applying restrictions over "oozie_launcher_queue" queue settings in FairScheduler. > In case of a Spark action, do I still to specify the queue? If yes, with which property (since Spark does not use MapReduce)? The Spark "action" part is still launched by Oozie via the standard MapReduce 1-map job launcher. The properties of oozie.launcher.* still applies here. > Does it make sense to specify values for oozie.launcher.mapreduce.map.java.opts, oozie.launcher.mapreduce.reduce.java.opts, oozie.launcher.yarn.app.mapreduce.am.command-opts as I did in the example? Yes it is OK to specify those (and/or the *.memory.mb variants for direct resource request controls) to reduce the footprint of the usually-light Oozie launcher jobs.

Harsh J · ‎11-23-2018

Your port for broker is incorrect in that command, you're supplying the ZooKeeper port of 2181 in an argument that requires the Broker client port of 9092. Follow our guide at https://www.cloudera.com/documentation/kafka/latest/topics/kafka_command_line.html for using the command line tools.

Harsh J · ‎11-19-2018

Oozie 4 and below makes this scenario special because its launcher task containers actually act as drivers for further application launches. Since these launcher task containers are singular map tasks, they are not considered as Application Masters by YARN, the default defence of limiting the AM share of memory and CPU in the pool to avoid deadlock starvation does not work. This results in situations where enough Oozie launcher jobs take up all of the memory/CPU resources, leaving no room for actual job requests to pass and start a container. The best way to solve this would be to have your Oozie launchers go to a different pool than the applications it runs. You can achieve this by configuring oozie.launcher.mapreduce.job.queuename in the workflow actions to a pool different than where the apps are intended to go to. Another way to solve it is to use Uber mode for Oozie launchers, but this comes with a few faults around configuring and use of native libraries, classpath isolation, etc. and is not recommended anymore. In Oozie 5+ the special Oozie AM application can help avoid this, as it will be considered part of the AM share of any pool: https://issues.apache.org/jira/browse/OOZIE-1770

Harsh J · ‎11-18-2018

@maziyar, Are you sure you have "Always Use Linux Container Executor" checked and "Limit Nonsecure Container Executor Users" unchecked, and no safety valves overriding relevant properties? What CDH and CM version are you running?

Harsh J · ‎10-31-2018

This feature arrived with YARN-7157, and is available in CDH 6.0 and higher versions. You'll need to specify "yarn.resourcemanager.display.per-user-apps" as "true" in the YARN Resource Manager yarn-site.xml safety valve to activate it on CDH6.

Harsh J · ‎10-24-2018

The /tmp/file.txt is a HDFS path, not a local path. It could be any HDFS path if your Spark is configured to use HDFS as a default FS - I used /tmp just for illustration. The same should work on local FS modes too, but I've not tried it.

Harsh J · ‎10-22-2018

@zbz, The group ownership of all directories under /tmp/logs must be 'hadoop' or any group ID that's common between the 'yarn' and 'mapred' IDs. In your case you have it as supergroup, which does not have 'mapred' as its member, but is also the entirely wrong group to use - you do not want to grant HDFS superuser access to YARN service. I'd recommend removing 'yarn' from the 'supergroup' group. This is what a normal installation should appear as: # id -Gn mapred mapred hadoop # id -Gn yarn yarn hadoop # hadoop fs -ls -d /tmp/logs drwxrwxrwt - mapred hadoop 0 2017-08-30 22:36 /tmp/logs So if the 'hadoop' group is shared by your two IDs (mapred and yarn) then you may execute the below (as a HDFS superuser) to resolve the issue permanently: hadoop fs -chgrp -R hadoop /tmp/logs

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: difference between 'mapreduce.application.clas...

Re: difference between 'mapreduce.application.clas...

Re: Cleaning /dfs/dn sub-directories to free disk ...

Re: Best practices to correctly handle multiple co...

Re: Kafka is not working / Warning while sending m...

Re: Best practices to correctly handle multiple co...

Re: YARN force nobody user on all jobs (and so the...

Re: Resource Manager UI authentication and filter ...

Re: Conversion of a file(with pipe(|), comma(,) an...

Re: Yarn Aggregate Log Retention Setting