Member since
01-16-2014
336
Posts
43
Kudos Received
31
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1795 | 12-20-2017 08:26 PM | |
1820 | 03-09-2017 03:47 PM | |
1612 | 11-18-2016 09:00 AM | |
2331 | 05-18-2016 08:29 PM | |
2049 | 02-29-2016 01:14 AM |
06-09-2015
12:52 AM
Part of that we have already included via configuration when you run on yarn There are still known issues on the Spark side which makes recovery on the Spark side not straight forward and not robust in all failure cases. AM failures are notoriously hard to recover but I would expect a direct kill of the AM container to be seen as a failure and being picked up by YARN for a restart. How do you kill the container (kill -9 of the JVM ?) Wilfred
... View more
06-09-2015
12:44 AM
The scheduler information is important, all that configuration is part of the RM configuration (yarn-site.xml) and the config files for the different schedulers, like the fair-scheduler.xml file. The log from the RM or the AM for the application will also be needed to see what is going on. Wilfred
... View more
06-09-2015
12:41 AM
Good to hear that this has been fixed! We have seen this issue in early CDH 5 releases but this was fixed in CMC/CDH 5.2 and later. Cloudera Manager should have deployed that configuration setting for you in the client config on all nodes. If you did not use CM then that could explain it, otherwise I am would not know how that could have happened. Wilfred
... View more
06-05-2015
12:21 AM
Which scheduler is used? How is it configured, resources etc.. What settings do you have for the (AM) containers? Is there a log for the AM container which shows more? Wilfred
... View more
06-05-2015
12:17 AM
It does depend on how the driver dies. As you said the AM is retried based on the settings under certain circumstances. You seem to have stumbled onto a case which is not handled correctly. However we would need to know a bit more: why did the driver fail? do you have a log of the container that ran the driver so we can see what the cause was for the driver failure? Wilfred
... View more
06-02-2015
10:25 PM
I completely overlooked the fact that this is not the FairScheduler but the CapacityScheduler. The change that was made has gone into the overarching code for both and we have seen the fix work for the FairScheduler and it recovers from the issue with the standard config. We recommend using the Fairscheduler for a CDH release since we do far more testing, also at scale, and develpoment work on it. That said: can you show the lines (20-25 at least) just after the error was trhown that should shed some more lihgt on what the scheduler is doing. Wilfred
... View more
06-02-2015
08:53 PM
In a cluster which is kerberised there is no SIMPLE authentication. Make sure that you have run kinit before you run the application. Second thing to check: In your application you need to do the right thing and either pass on the TOKEN or a KERBEROS ticket. When the job is submitted, and you have done a kinit, you will have TOKEN to access HDFS you would need to pass that on, or the KERBEROS ticket. You will need to handle this in your code. I can not see exactly what you are doing at that point in the startup of your code but any HDFS access will require a TOKEN or KERBEROS ticket. Cheers, Wilfred
... View more
06-02-2015
07:45 PM
This works for me in 2.6 and 2.7. It looks like you have another problem which is causing this, not related to the python version. Based on the message about the resources I would look into the environment and make sure that all variables and paths are set with the same values after you switch python version. You can look at the pyspark master UI and check that you have executors etc. Wilfred
... View more
06-02-2015
07:06 PM
You can start multiple pysparks on one host under the same user name. The shell, just as with the scala shell, will find an unused port ans allow you to do what is needed. There is no limitation on the pyspark side you need to work around. I am not sure how the notebook needs to be configured to allow multiple to run at once. Wilfred
... View more
06-01-2015
05:24 AM
If you are not running the yarn command as the owner of the application you might need to add: -appOwner <username> To the yarn logs command line. If you do not have access the error you showed could be thrown. We do not distinguish between not getting access and not finishing the aggregation. Wilfred
... View more
06-01-2015
04:49 AM
The reservations for memory and vcores is logged in the normal RM log. The RM UI also shows it in the main UI at the top under the cluster metrics. There are two values Memory Reserved and Vcores Reserved. Wilfred
... View more
06-01-2015
04:35 AM
There is an existing issue in releases before CDH 5.3.3 which could cause the issue to show. That issue was introduced to fix a similar issue in an earlier release. Both issues were intermittent and related to HA. Unless you are on CDH 5.3.3 or later you could be seeing one of those. Wilfred
... View more
05-28-2015
04:40 AM
There have been API changes which makes it impossible to compile certain parts of Spark against the new version of Hive. All the parts that can work are included in the Spark version that come with CDH 5.4 Wilfred
... View more
05-28-2015
04:26 AM
Hive on Spark uses Spark as the execution framework like MapReduce is the execution framework. Spark SQL uses Hive dependencies but that side is not supported. Hive in CDH is newer than the Hive Spark is designed against and there are parts that do not work against the new release of Hive. Some parts will work others will not. Wilfred
... View more
05-28-2015
04:13 AM
You need to provide a ltille more detaul: standalone or on yarn, cmd line and the environment settings would be a good start. The error points to something else than a python version issue. Wilfred
... View more
05-28-2015
03:50 AM
Sorry, this slipt through the cracks. If you have already turned of the ACL then you should be able to get the logs via the command line. Run yarn logs -applicationId <APPLICATION ID> That should return the full log and also follow the normal process through all the proxies and checks to get the files and we should be able to hopefully tell what is going on in more detail. Wilfred
... View more
05-27-2015
10:11 PM
The fact that the two traces are different means that there still is something going on inside the RM. Not everything is locked up. I assume that you have looked for threads that have a Thread.State of BLOCKED or WAITING. TIMED_WAITING threads are OK nothing wrong with those BLOCKED or WAITING means that something more is going on. BLOCKED is the real bad one. WAITING could be OK for worker threads that are waiting for work to be released from a queue or something like it.They normally use a monitor for that. The larger stack trace snippet that you uploaded does not show anything wrong. The threads that are in WAITING are fine (at least the ones I can see there). Based on this information I can not tell why things are hanging. I have not seen the issue in my local setup and can not reproduce it either. Wilfred
... View more
05-26-2015
06:11 PM
In CM & CDH 5.4 you should unset it and let it use the one that is there on the nodes. Much faster. Wilfred
... View more
05-25-2015
07:18 PM
CDH 5.4 has a patched Spark 1.3 and is build on a patched Hadoop 2.6. Why not use that? The fact that you have this issue shows that you have old files or pointers to old configuration hanging around. You should also not build your own Spark but use the Spark that comes with CDH (which is the same version). If you use CM all the configuration you need is created for you and there is no need to do a trial and error for what needs to be set. Otherwise make sure that you have all Spark and YARN configuration on the host that executes the action (oozie host). BTW: oozie.service.SparkConfigurationService.spark.configurations is a comma separated list of "key=value" pairs. Setting the master in the default.conf also seems a bit strange since you must have it in the xml. Use the xml from the action to set as much as possible. (see spark action docs) using the defaults or can lead to strange behaviour if you use it outside oozie. A good start would also be to try and run the Pi example that comes with Spark on the oozie host to check if all configuration is correct before building the oozie action. Wilfred
... View more
05-25-2015
06:45 PM
Why are you using SparkFiles? The path that you try to open is not defined because SparkFiles expects paths to files added through SparkContext.addFile(). Unless you have done that you should be using sc.textFile() and pass in the URI for the file (hdfs://... or something like it) Wilfred
... View more
05-25-2015
06:34 PM
In a recent version (CM/CDH 5.4 as an example) the directory should just look like what you have now. We do not push the assembly separately any more. It uses the assembly installed on the nodes, by default, that is faster than using the one from HDFS.The setting is still there to allow custom assemblies to be used. The setting should be entered without the HDFS in front and the path will be pushed out with HDFS in front (CM will handle that for you). Which version of CDH and CM are you using? Wilfred
... View more
05-25-2015
05:41 PM
I should have been clearer in my request: for a java process we do not use the normal stack dump utility. We use "jstack" that comes with the JVM. It will show a nidely formatted dump if you run it as the user that owns the process. So for the RM I would: su - yarn <path to java bin>/jps | grep ResourceManager <path to java bin>/jstack <pid from RM> That stack trace will show exactly what the thread is doing and what it is waiting on, example from my RM: 2015-05-26 00:39:09
Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode):
"807502238@qtp-2034297567-6432" daemon prio=10 tid=0x00007fbfec163000 nid=0x2edc in Object.wait() [0x00007fbfd942e000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000dbe2bd80> (a org.mortbay.thread.QueuedThreadPool$PoolThread)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:626)
- locked <0x00000000dbe2bd80> (a org.mortbay.thread.QueuedThreadPool$PoolThread)
"Attach Listener" daemon prio=10 tid=0x00007fbffc196000 nid=0x2d9f waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE It also shows the system thread all the way at the end of the stack and those threads are less interesting than the top threads. Wilfred
... View more
05-25-2015
05:13 PM
1 Kudo
A1: check the Hdfs Design page for details on what is stored were. The edits log and file system image are on the NN. Look for the section on persistence on file system data. For more detail on setting up the cluster follow Cluster Setup. A2: if you have the disk then having a mirrored disk will make it more resilient. Making a backup is still a good idea 😉 Wilfred
... View more
05-22-2015
09:00 AM
That depends on how big the stack trace is. but normally using the code insert (button with <> on it) you can add it directly here in the message. You can also put them somewhere public (like a gist) and link them here. Wilfred
... View more
05-22-2015
08:42 AM
1 Kudo
On the master node HDFS will store things like the FSImage, edit file and other relevant files on the disk. Not huge but it needs quick access. For the DN: - Even or odd does not matter, it can handle what you give it. - The number of spindles (disks) is important for the number of containers you can run on the host. We normally say about 2 containers per disk can be supported. Since you have a large number of cpu cores and a lot of memory having a larger number of disks will allow you to run more containers on the node. Decreasing the number of disks means you also should lower the number of containers. Looking at the cpu cores and disks: they seem to be nicely balanaced the way you have it now with the 300GB disks. Wilfred
... View more
05-21-2015
11:18 PM
1 Kudo
You do not need to mirror the disks (beside OS) if you are running HDFS HA. On the master nodes: get one disk just for HDFS and you can store all logs on the other disk. One disk for HDFS will get you the best performance since writes are synchronous to that disk. Also make sure that the CM services store logs and DB's on the disk that does not have HDFS on it. On the DATA nodes If you have 2 disks for OS (mirrored) and you thus have 300 GB available I would not use the other 300 GB for apps and logs. Add those 2 disks to your HDFS disks. The logs and apps can live on the OS disk on those nodes. If you are going to use Spark make sure that you use Spark on YARN. We recommend using that instead of using the stand alone mode saves resources and it has been tested far better. We do have recommendations about vcores/mem/disks in our yarn tuning documentation Wilfred
... View more
05-21-2015
10:59 PM
1 Kudo
There has been a change in the indirect dependencies that get added by Spark. Spark itself has no dependency on HBase and thus wil not have any HBase jars on its path by default. The Hive integration does however and that used to give you all the classes to run a HBase application on Spark without the need to do anything. Hive and Hbase have changed and this is not the case any more.That is the cause of this "breakage". However an application should not have relied on this indirect dependency loading of jars, and you need to add whatever you need to the classpath yourself. This is the workaround for a customer to get this working (parcel based distribution using CM): add the hbase jars to the executor classpath via the following setting: login to ClouderaManager go to the Spark on YARN service go to the Configuration tab type defaults in the search box select gateway in the scope add the entry: spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar save the change and you will see an icon appear to deploy client configs (can take 30 sec to show) deploy the client config run the spark application accessing hbase by executing the following: spark-submit --master yarn-cluster --driver-class-path /etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar .... If you are not using CM you can make the changes manually as long as you make sure that the htrace jar (that specific version) is on the path. Wilfred
... View more
05-20-2015
08:39 PM
That is a version which should not have a problem like this. If you can reproduce this each and every time can you send me a set of stack traces (3 taken 5 seconds apart) so we can see what the state of the RM is. Also include the yarn-site.xml with that. We should not lock up when you request more than the allowed size and I have not seen it happen for me. Wilfred
... View more
05-17-2015
04:16 PM
Based on the fact that the Pi example works, YARN works and it is something that is being done in the application code. In a secured cluster the container does not run as the same user as the Node Manager. Normally the node manager runs as the user YARN and the container as the user whom started the application. The configuration directory that you are trying to access is the configuration directory of the service not for the container. The container gets the configuration passed in. You should not be accessing the service configuration as it most likely differs from the container configuration. There are a lot of settings that are not relevant for a service in the configuration and thus are not set or set to the hadoop default. If they exist they normally are ignored and read from the configuration that was created by the application on submission. The application should never accesses those files. Wilfred
... View more
05-15-2015
12:58 AM
Please always provide the CDH version and if you use it the CM version. I can see that you use CM based on that path. Can you tell me if you enabled kerberos through the wizard or manually? If you ran a simple Pi exampe job does it work or does that fail also? For this failure can you provide a full stack trace. There is information missing and I would like to see where the exception is thrown from. Wilfred
... View more
- « Previous
- Next »