About Anthony

Anthony · ‎04-19-2017

Question Can I change an environment once created in Cloudera Altus? Answer By design, Cloudera Altus environments are not modifiable once created. To make a change to the environment, we recommend to clone the environment into a new one with the appropriate changes.

Anthony · ‎04-19-2017

Question Is logging enabled by default for jobs run in Cloudera Altus Data Engineering? Answer While we recommend enabling logging for troubleshooting Data Engineering jobs, It is not enabled by default. Details for enabling logging within the Cloudera Altus Environment Setup is available in the online doc Cloudera Altus Environment Setup: https://console.altus.cloudera.com/support/documentation.html?type=topics&page=ag_dataengr_get_start_env_setup NOTE: If logging is not enabled, cluster bundles that are automatically generated during shutdown of a Cloudera Altus-managed cluster will be lost.

Anthony · ‎02-17-2017

Thanks for the update MSharma, glad to hear that you were able to resolve the issue!

Anthony · ‎02-07-2017

Hi MSharma, Would you be able to provide additional context regarding the failure / permission issue that you're experiencing? If there's a specific error message or symptom that is occurring could you provide more details as to what is happening?

Anthony · ‎12-01-2016

Hi Nickk, If you are looking for what features that are available for YARN resource accounting, we do have two metrics available within the YARN API, as well as a more robust reporting capability within Cloudera Manager 5.7 onward. The following are the definitions of memorySeconds and vcoreSeconds which are used to provide a very basic measurement of utilization in YARN[1]: memorySeconds = The aggregated amount of memory (in megabytes) the application has allocated times the number of seconds the application has been running. vcoreSeconds = The aggregated number of vcores that the application has allocated times the number of seconds the application has been running. The memorySeconds value can be used loosely for generically measuring the amount of resource that the job consumed; For example, job 1 used X amount of memorySeconds as compared to job 2 which used Y amount of memorySeconds. Any further calculations attempting to extrapolate further insight from this measure isn't recommended. There are some additional reporting efforts that are being worked on, one is now available with CM. Starting in CM 5.7 on, CM offers cluster utilization reporting which can help provide per tenant/user cluster usage reporting. Further details regarding Cluster Utilization reporting in CM is available here[2]. References: [1] Link to ApplicationResourceUsageReport.java (part of the YARN API) in the Apache source code for Hadoop: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationResourceUsageReport.java [2] Link to Cloudera Documentation regarding CM's Cluster Utilization Reporting functionality: http://www.cloudera.com/documentation/enterprise/5-7-x/topics/admin_cluster_util_report.html Hope this helps!

Anthony · ‎10-14-2016

Regarding the questions asked: > What characters are allowed in aliases? e.g. is "!" allowed? In regards to Avro aliases, it follows the name rules[1], essentially: 1) Must start with [A-Za-z_] 2) Subsequently contain only [A-Za-z0-9_] > Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema" > API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is > it possible to get the mapping from name to aliases from DF? Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark. To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark. References: [1] http://avro.apache.org/docs/1.8.1/spec.html#names [2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html [3] Avro to Spark SQL Conversion: The spark-avro library supports conversion for all Avro data types: boolean -> BooleanType int -> IntegerType long -> LongType float -> FloatType double -> DoubleType bytes -> BinaryType string -> StringType record -> StructType enum -> StringType array -> ArrayType map -> MapType fixed -> BinaryType The spark-avro library supports the following union types: union(int, long) -> LongType union(float, double) -> DoubleType union(any, null) -> any The spark-avro library does not support complex union types. All doc, aliases, and other fields are stripped when they are loaded into Spark.

Anthony · ‎10-11-2016

Hi Msun, Regarding your questions: > What characters are allowed in aliases? e.g. is "!" allowed? In regards to Avro aliases, it follows the name rules[1], essentially: 1) Must start with [A-Za-z_] 2) Subsequently contain only [A-Za-z0-9_] > Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema" > API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is > it possible to get the mapping from name to aliases from DF? Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark. To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark. References: [1] http://avro.apache.org/docs/1.8.1/spec.html#names [2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html [3] Avro to Spark SQL Conversion: The spark-avro library supports conversion for all Avro data types: boolean -> BooleanType int -> IntegerType long -> LongType float -> FloatType double -> DoubleType bytes -> BinaryType string -> StringType record -> StructType enum -> StringType array -> ArrayType map -> MapType fixed -> BinaryType The spark-avro library supports the following union types: union(int, long) -> LongType union(float, double) -> DoubleType union(any, null) -> any The spark-avro library does not support complex union types. All doc, aliases, and other fields are stripped when they are loaded into Spark.

Anthony · ‎05-29-2015

Hi Jiro, Thanks for bringing this to our attention. It appears there's a bug with the validation for that property, in which the lower bound should be 0 and not 1. We'll get this fixed in a future release. In the meantime, you can accomplish setting mapreduce.shuffle.max.connections to 0 by setting it in a specific Safety Valve in Cloudera Manager. Please follow the directions listed below: Login to CM with admin privileges if you haven't done so already. Navigate to the YARN configuration page (CM -> Clusters -> YARN -> Configuration) Naviate to Gateway Default Group -> Advanced -> MapReduce Client Advanced Configuration Snippet (Safety Valve) for mapred-site.xml In the field marked Value, please enter in the following entry: <property> <name>mapreduce.shuffle.max.connections</name> <value>0</value> </property> 5. Save changes then Deploy client configuration Kind Regards, Anthony

Anthony · ‎05-12-2015

Hi TS, Responses inline below: > Having said that the JobHistory Server is specific to Map Reduce jobs run on YARN, where other type of jobs will be shown? That will depend on what kind of application is being submitted to the YARN framework. We know that if a MR2 job is submitted, the job details will be available while the job is running within the Resource Manager Web UI (as this is part of the YARN framework); When the job is completed, the job details will be available via the Job History Server. If a Spark-on-YARN job was is submitted, the job details will still be availabile while the job is running within the Resource Manager Web UI, however when the job completes, the job details will then be available on the Spark History Server, which is a separate role/service that is configured when Spark-on-YARN if setup as a service in Cloudera Manager (or when configuring it in CDH, per our installation guide). > Besides MR and Spark jobs, what other types of jobs can we launch via YARN? MR and Spark jobs are what is currently supported, however this may change in the future, as the need arises. YARN is application agnostic and is intentionally designed to allow developers to create applications to run on its distributed framework. Additonal details regarding YARN applications are available here, from this link. > Are jobs moved from /tmp/logs/<user-id>/logs folder to /user/history/done & /user/history/done_intermediate ones? > Are they created simultaneously? To best clarify the answer, listed below is a brief overview of the order of operations of a MR job in YARN: 1) MR job submitted to RM from client 2) Application folder is created in /tmp/logs/<user-id>/logs/application_xxxxxxxxxxxx_xxxx 3) MR job runs in YARN on the cluster 4) MR job completes, counters from job are reported on job client that submitted job 5) Counter information (.jhist file) and job_conf.xml files are written to /user/history/done_intermediate/<user>/job_xxxxxxxxxx_xxxx* 6) .jist file and job_conf.xml are then moved from /user/history/done_intermediate/<user>/ to /user/history/done 7) Container logs from each Node Manager is aggregated into /tmp/logs/<user-id>/logs/application_xxxxxxxxxxxx_xxxx Hope ths helps!

Anthony · ‎05-07-2015

Hi TS, Thanks for your post. In regards to what you have reported, is the issue that you're seeing specific only to Spark jobs submitted to YARN? If that's the case, it's important to note that the Job History Server in is specific to Map Reduce jobs run on YARN and not actually for Spark. The history of Spark jobs submitted to YARN is handled by a completely separate service called the Spark History Server. Are you able to run a simple Pi Mapreduce job submitted to YARN, and does that appear in the JHS Web UI once completed?

Online	Offline
Last Visited	‎09-26-2019 02:52 PM

Member Since	‎08-22-2014 01:06 PM
Last Visited	‎09-26-2019 02:52 PM
Posts	45
Kudos received	10

Cloudera Community

Re: Rack Awareness - Would modifying Rack Awarenes...

Re: Spark-sql fails to use "SELECT" on Aliases on ...

Re: Spark-sql fails to use "SELECT" on Aliases on ...

Re: Can't set mapreduce.shuffle.max.connections to...

Re: YARN JobHistory Logs (http:<server>:19888/jobh...

Changing Cloudera Altus Environments

Is logging enabled by default for jobs run in Altu...

Re: YARN JobHistory Logs (http:<server>:19888/jobh...

Re: YARN JobHistory Logs (http:<server>:19888/jobh...

Re: CPU and Memory Usage per job perspective.

Re: Spark-sql fails to use "SELECT" on Aliases on ...

Re: Spark-sql fails to use "SELECT" on Aliases on ...

Re: Can't set mapreduce.shuffle.max.connections to...

Re: YARN JobHistory Logs (http:<server>:19888/jobh...

Re: YARN JobHistory Logs (http:<server>:19888/jobh...