Member since
07-29-2013
366
Posts
69
Kudos Received
71
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6191 | 03-09-2016 01:21 AM | |
5032 | 03-07-2016 01:52 AM | |
15076 | 02-29-2016 04:40 AM | |
4742 | 02-22-2016 03:08 PM | |
5750 | 01-19-2016 02:13 PM |
09-05-2014
03:24 AM
I dont' follow... workers are irrelevant to the problem I am suggesting. You could have 1000 workers with 1TB memory and still fail if you try to copy 250MB into memory on your driver process, and the driver does not have enough memory. Spark can certainly hold the data in memory on workers, but that is not what your code asks it to do. Why are you calling collect()?
... View more
09-05-2014
02:58 AM
Yes, but your call to collect() says "please copy all of the results into memory on the driver". I believe that's what is running out of memory. I don't see any evidence that the workers have a problem.
... View more
09-05-2014
12:01 AM
In the first example, you're collecting the entire data set into memory on your driver process. I don't know how much memory you gave it, but if your machines have 512MB memory (total?) a 250MB data set, accounting for Java overhead, probably blows through all of that, so an OOM is expected. The second looks like some kind of error during execution of the Spark task. It's not clear from that log what the error is, just that a task failed repeatedly. It could be a Spark SQL problem, but you'd have to look at the task log to determine why that did not work. (There is the same potential problem with collecting all data to the driver, but it didn't get that far.) The third instance also has a similar problem, but the error you see is an HDFS error. It sounds like datanodes are not working. Are these nodes trying to run Spark and HDFS in 512MB? or do you mean Spark has 512MB? I'd check the health of HDFS.
... View more
09-01-2014
02:27 AM
See my message above about modifying roles. You would just set an additional host to be a worker. I'm assuming you are using standalone mode.
... View more
08-31-2014
01:21 AM
The default is that you manually trigger model builds. But you can configure it to build after a certain amount of time has elapsed, or a certain number of data points have been written. See model.time-threshold and model.data-threshold. Yes, all data points cause in-memory model updates no matter how they arrive.
... View more
08-29-2014
03:54 PM
This is a conflict between the version of Guava that Spark uses, and the version used by Hadoop. How are you packaging your app? and can you run with spark-submit? this tends to take care of this conflict.
... View more
08-29-2014
04:00 AM
It doesn't make sense to put two workers on one host. One worker can host many executors, and an executor can even run many tasks in parallel. Your default parallelism will be a function of the number of cores, which should much more than 1. As long as your input has more than one partition you'll get parallel execution. If not, use repartition() to make more partitions.
... View more
08-28-2014
09:12 AM
1 Kudo
Evaluating recommender systems is always a bit hard to do in the lab, since there's no real way to know which of all of the other items are actually the best recommendations. The best you can do is hold out some items the user has interacted with and see whether they rank highly in the recommendations later, after building a model on the rest of the data. The ALS implementation does do this for you automatically. You will see an evaluation score printed in the logs as an "AUC" score or area under the curve. 1.0 is perfect; 0.5 is what random guessing would get. So you can get some sense there. Above 0.7 or so is probably "OK" but it depends. That's about all the current version offers. In theory this helps you tune the model parameters since you can try a bunch of values of lambda, features, etc. and pick the best. But that's very manual. The next 2.0 version being architected now is going to go much further, and build many models at once and pick the best one. So it will be able to pick hyperparameters automatically.
... View more
08-13-2014
08:26 AM
1 Kudo
You should probably decrease the number of partitions. Fewer partitions means fewer workers but evidently you need nowhere near 30 workers to keep up. This reduces the number of files per interval. You can use a longer interval too. Finally, you can post-process these files to do something else with them, including combining them and deleting the originals if desired.
... View more
08-05-2014
03:42 AM
2 Kudos
Why? in a kerberized environment, to access resources you need to integrate with kerberos. The Spark project hasn't implemented anything like that. YARN works with kerberos, and so it can work with kerberos by leveraging YARN. Maybe part of the answer is, why is it necessary if it works through YARN?
... View more