About Wilfred

Wilfred · ‎06-09-2015

Good to hear that this has been fixed! We have seen this issue in early CDH 5 releases but this was fixed in CMC/CDH 5.2 and later. Cloudera Manager should have deployed that configuration setting for you in the client config on all nodes. If you did not use CM then that could explain it, otherwise I am would not know how that could have happened. Wilfred

Wilfred · ‎06-01-2015

If you are not running the yarn command as the owner of the application you might need to add: -appOwner <username> To the yarn logs command line. If you do not have access the error you showed could be thrown. We do not distinguish between not getting access and not finishing the aggregation. Wilfred

Wilfred · ‎06-01-2015

There is an existing issue in releases before CDH 5.3.3 which could cause the issue to show. That issue was introduced to fix a similar issue in an earlier release. Both issues were intermittent and related to HA. Unless you are on CDH 5.3.3 or later you could be seeing one of those. Wilfred

Wilfred · ‎05-28-2015

Sorry, this slipt through the cracks. If you have already turned of the ACL then you should be able to get the logs via the command line. Run yarn logs -applicationId <APPLICATION ID> That should return the full log and also follow the normal process through all the proxies and checks to get the files and we should be able to hopefully tell what is going on in more detail. Wilfred

Wilfred · ‎05-26-2015

In CM & CDH 5.4 you should unset it and let it use the one that is there on the nodes. Much faster. Wilfred

Wilfred · ‎05-25-2015

Why are you using SparkFiles? The path that you try to open is not defined because SparkFiles expects paths to files added through SparkContext.addFile(). Unless you have done that you should be using sc.textFile() and pass in the URI for the file (hdfs://... or something like it) Wilfred

Wilfred · ‎05-25-2015

In a recent version (CM/CDH 5.4 as an example) the directory should just look like what you have now. We do not push the assembly separately any more. It uses the assembly installed on the nodes, by default, that is faster than using the one from HDFS.The setting is still there to allow custom assemblies to be used. The setting should be entered without the HDFS in front and the path will be pushed out with HDFS in front (CM will handle that for you). Which version of CDH and CM are you using? Wilfred

Wilfred · ‎05-25-2015

A1: check the Hdfs Design page for details on what is stored were. The edits log and file system image are on the NN. Look for the section on persistence on file system data. For more detail on setting up the cluster follow Cluster Setup. A2: if you have the disk then having a mirrored disk will make it more resilient. Making a backup is still a good idea 😉 Wilfred

Wilfred · ‎05-22-2015

On the master node HDFS will store things like the FSImage, edit file and other relevant files on the disk. Not huge but it needs quick access. For the DN: - Even or odd does not matter, it can handle what you give it. - The number of spindles (disks) is important for the number of containers you can run on the host. We normally say about 2 containers per disk can be supported. Since you have a large number of cpu cores and a lot of memory having a larger number of disks will allow you to run more containers on the node. Decreasing the number of disks means you also should lower the number of containers. Looking at the cpu cores and disks: they seem to be nicely balanaced the way you have it now with the 300GB disks. Wilfred

Wilfred · ‎05-21-2015

You do not need to mirror the disks (beside OS) if you are running HDFS HA. On the master nodes: get one disk just for HDFS and you can store all logs on the other disk. One disk for HDFS will get you the best performance since writes are synchronous to that disk. Also make sure that the CM services store logs and DB's on the disk that does not have HDFS on it. On the DATA nodes If you have 2 disks for OS (mirrored) and you thus have 300 GB available I would not use the other 300 GB for apps and logs. Add those 2 disks to your HDFS disks. The logs and apps can live on the OS disk on those nodes. If you are going to use Spark make sure that you use Spark on YARN. We recommend using that instead of using the stand alone mode saves resources and it has been tested far better. We do have recommendations about vcores/mem/disks in our yarn tuning documentation Wilfred

Online	Offline
Last Visited	‎02-15-2023 08:41 PM

Member Since	‎01-16-2014 10:22 PM
Last Visited	‎02-15-2023 08:41 PM
Posts	336
Kudos received	43

Cloudera Community

Re: Shall we run multiple spark version jobs innoo...

Re: CompositeGroupsMapping

Re: Yarn Fair Scheduler Allocation file not found ...

Re: Odd behavior when pending mappers get stuck on...

Re: Have various Spark version running on the clus...

Re: Got 500 Error when trying to view status of a ...

Re: Got 500 Error when trying to view status of a ...

Re: Got 500 Error when trying to view status of a ...

Re: Got 500 Error when trying to view status of a ...

Re: hdfs:/user/spark/share/lib/spark-assembly.jar ...

Re: spark-shell directories lookup failure

Re: hdfs:/user/spark/share/lib/spark-assembly.jar ...

Re: CPU Configuration (cores/speed) for Master and...

Re: CPU Configuration (cores/speed) for Master and...

Re: CPU Configuration (cores/speed) for Master and...