About srowen

srowen · ‎09-05-2014

I dont' follow... workers are irrelevant to the problem I am suggesting. You could have 1000 workers with 1TB memory and still fail if you try to copy 250MB into memory on your driver process, and the driver does not have enough memory. Spark can certainly hold the data in memory on workers, but that is not what your code asks it to do. Why are you calling collect()?

srowen · ‎09-05-2014

Yes, but your call to collect() says "please copy all of the results into memory on the driver". I believe that's what is running out of memory. I don't see any evidence that the workers have a problem.

srowen · ‎09-05-2014

In the first example, you're collecting the entire data set into memory on your driver process. I don't know how much memory you gave it, but if your machines have 512MB memory (total?) a 250MB data set, accounting for Java overhead, probably blows through all of that, so an OOM is expected. The second looks like some kind of error during execution of the Spark task. It's not clear from that log what the error is, just that a task failed repeatedly. It could be a Spark SQL problem, but you'd have to look at the task log to determine why that did not work. (There is the same potential problem with collecting all data to the driver, but it didn't get that far.) The third instance also has a similar problem, but the error you see is an HDFS error. It sounds like datanodes are not working. Are these nodes trying to run Spark and HDFS in 512MB? or do you mean Spark has 512MB? I'd check the health of HDFS.

srowen · ‎09-01-2014

See my message above about modifying roles. You would just set an additional host to be a worker. I'm assuming you are using standalone mode.

srowen · ‎08-31-2014

The default is that you manually trigger model builds. But you can configure it to build after a certain amount of time has elapsed, or a certain number of data points have been written. See model.time-threshold and model.data-threshold. Yes, all data points cause in-memory model updates no matter how they arrive.

srowen · ‎08-29-2014

This is a conflict between the version of Guava that Spark uses, and the version used by Hadoop. How are you packaging your app? and can you run with spark-submit? this tends to take care of this conflict.

srowen · ‎08-29-2014

It doesn't make sense to put two workers on one host. One worker can host many executors, and an executor can even run many tasks in parallel. Your default parallelism will be a function of the number of cores, which should much more than 1. As long as your input has more than one partition you'll get parallel execution. If not, use repartition() to make more partitions.

srowen · ‎08-28-2014

Evaluating recommender systems is always a bit hard to do in the lab, since there's no real way to know which of all of the other items are actually the best recommendations. The best you can do is hold out some items the user has interacted with and see whether they rank highly in the recommendations later, after building a model on the rest of the data. The ALS implementation does do this for you automatically. You will see an evaluation score printed in the logs as an "AUC" score or area under the curve. 1.0 is perfect; 0.5 is what random guessing would get. So you can get some sense there. Above 0.7 or so is probably "OK" but it depends. That's about all the current version offers. In theory this helps you tune the model parameters since you can try a bunch of values of lambda, features, etc. and pick the best. But that's very manual. The next 2.0 version being architected now is going to go much further, and build many models at once and pick the best one. So it will be able to pick hyperparameters automatically.

srowen · ‎08-13-2014

You should probably decrease the number of partitions. Fewer partitions means fewer workers but evidently you need nowhere near 30 workers to keep up. This reduces the number of files per interval. You can use a longer interval too. Finally, you can post-process these files to do something else with them, including combining them and deleting the originals if desired.

srowen · ‎08-05-2014

Why? in a kerberized environment, to access resources you need to integrate with kerberos. The Spark project hasn't implemented anything like that. YARN works with kerberos, and so it can work with kerberos by leveraging YARN. Maybe part of the answer is, why is it necessary if it works through YARN?

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Memory Issues in while accessing files in Spar...

Re: Memory Issues in while accessing files in Spar...

Re: Memory Issues in while accessing files in Spar...

Re: Extra worker in spark

Re: Question aboit api end point /ingest

Re: Run Spark App Error

Re: Extra worker in spark

Re: Performance evaluation in Oryx

Re: Control the number of files created from Spark...

Re: Using Spark on a Kerberos Cluster