About Wilfred

Wilfred · ‎07-20-2015

YARN-2865 is fixed in CDH 5.3.3 and later and CDH 5.4.0 and later. You are most likely seeing something that looks like YARN-2865 but is slightly different. Unless the change for YARN-2865 is incorrect, which could also happen. Can you please share the logs that show the exception. Wilfred

Wilfred · ‎07-19-2015

In Spark a transformation works directly on the RDD. Transforms are implemented lazely and closely coupled to the RDDs. You can not use them separately. What you are looking for is a tool that can generate Saprk code for you based on the transformation rule. I don't think that something like that exists. Wilfred

Wilfred · ‎07-15-2015

Do these transformations not work for you? Anything that you write in Spark can be adjusted to work with different storage underneath. What else would you be looking for. Wilfred

Wilfred · ‎07-15-2015

A container log is not part of the yarn service logs and will not be affected by any of the yarn settings. The container log looks like a log from an AM and that means that you most likely are looking at a problem of the AM web UI not being able to bind. The AM web ui will bind to an ephemeral port which can not be limited to a set of ports. Make sure that you allow binding to any port on the NM's from your security groups in AWS. Wilfred

Wilfred · ‎07-08-2015

Check this part of the documentation for YARN tuning it explains it all. You might have a default value set which you have overlooked causing the issue. Wilfred

Wilfred · ‎07-08-2015

You need to always provide your own dependencies for your application. There is no dependency on HBase in Spark and the fact that some of the HBase jars are pulled in due to being part of a Hive dependency which Spark has is a coincidence. If you build an application then you should always make sure that you resolve your own dependencies. It might have worked in previous versions or in a distribution from a different provider out of the box becuase the Spark version had different dependencies. BTW: you should be using the spark.[driver|executor].extraClassPath settings as that is the current way to do this. Wilfred

Wilfred · ‎07-07-2015

The fact that the job runs as the hive user is correct. You have impersonation turned off when you turned on Sentry, at least that is what you should have done. The Hive user is thus the user that executes the job. However the end user should be used to retrieve which queue the application is submitted in (if you use the FairScheduler). This does require some configuration on your side to make this work. There is a Knowledge Base article in our support portal on how to set that up for CM and non CM clusters. Search for "Hive FairScheduler". I can remember already providing the steps using CM before on the forum: Login to Cloudera Manager Navigate to Cluster > Yarn > Instances > ResourceManager > Processes Click on the link fair-scheduler.xml, this will open a new tab or window Copy the contents into the a new file called: fair-scheduler.xml On the HiveServer2 host create a new directory to store the xml file (for example, /etc/hive/fsxml) Note: This file should not be placed in the standard Hive configuration directory since that directory is managed by Cloudera Manager and the file could be removed when changing other configuration settings. Upload the fair-scheduler.xml file to the above created directory In Cloudera Manager navigate to Cluster > Hive > Service-Wide > Advanced > Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml and add the following property: <property> <name>yarn.scheduler.fair.allocation.file</name> <value>/etc/hive/fsxml/fair-scheduler.xml</value> </property> Save changes Restart the Hive Service NOTE: you must have the follwoing rule as the first rule in the placement policy: <rule name="specified" /> Wiflred

Wilfred · ‎07-06-2015

Please make sure that you also have added the setting to the configuration on the client node. The setting should be applied to all nodes in the cluster not just the nodes that run the service. Wilfred

Wilfred · ‎06-29-2015

We do not support upgrading Spark without upgrading the rest of CDH. Spark is compiled against a version of Hadoop and the versions of Hadoop can change between releases of CDH. You also need to take into account the dependencies of Spark (like Hive) which might change between versions. Even if you would be able to upgrade the package you might get weird failures due to the dependency breakage. Wilfred

Wilfred · ‎06-29-2015

No there is nothing that you can run to check if the log aggregation has finished. It is a distributed state only known inside the NM's The only thing you can do is retry the log retrieval. Log aggregation is performed by the NodeManager(s) when the containers finish. There is no possibility to tell how long that will take since one node could be running more than one container that finishes at almost the same time. The load on HDFS is also a factor: copying to HDFS will only be as fast as HDFS can handle it at that point.Wilfred

Online	Offline
Last Visited	‎02-15-2023 08:41 PM

Member Since	‎01-16-2014 10:22 PM
Last Visited	‎02-15-2023 08:41 PM
Posts	336
Kudos received	43

Cloudera Community

Re: Shall we run multiple spark version jobs innoo...

Re: CompositeGroupsMapping

Re: Yarn Fair Scheduler Allocation file not found ...

Re: Odd behavior when pending mappers get stuck on...

Re: Have various Spark version running on the clus...

Re: CDH 5.4.1 - Yarn Hight Availability does not w...

Re: Data transformations at Spark level

Re: Data transformations at Spark level

Re: Got 500 Error when trying to view status of a ...

Re: YARN apps stuck, won't allocate resources

Re: ClassNotFoundException: org.apache.htrace.Trac...

Re: sentry + hive + kerberos resource management

Re: Got 500 Error when trying to view status of a ...

Re: Issues upgrading Spark from Spark 1.3 -> Spark...

Re: After updating config properties Cloudera Mana...