Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3454 | 01-26-2018 04:02 AM | |
7090 | 12-22-2017 09:18 AM | |
3538 | 12-05-2017 06:13 AM | |
3857 | 10-16-2017 07:55 AM | |
11231 | 10-04-2017 08:08 PM |
01-13-2016
03:49 AM
1 Kudo
Have a look at this minimal get-started example: http://oryxproject.github.io/oryx/docs/endusers.html
... View more
10-22-2015
03:29 AM
Please give more information than just a command line and stack trace. You don't have "SimpleApp" in that JAR, or it's not the fully qualified name of the class.
... View more
10-14-2015
03:57 AM
It's tricky because in general the ALS implementation we are talking about is a special case compared to normal models, but it's a big special case. I think the general architecture is correct at the level it's presented. I don't want to complicate it too much. Your feedback is a value contribution. Problems and bug fixes are important, but also ways the architecture could be improved or opened up.
... View more
10-14-2015
03:48 AM
1 Kudo
Oh! on re-reading this, I realize it already consumes its own updates, actually. It took a moment of reading this to recall the architecture. I should really add a note in the source code. The setUserVector/setItemVector you see is actually where it consumes updates from the batch layer. The batch layer generally does not produce updates of course, but this is again a special case. The ALS model is so large that it has to be shipped around as a huge set of updates. This is tidy. But, this also means it is listening to its own updates and processing them in exactly the same way. So -- at a short delay -- it is hearing its own updates and applying them. Even if this were not so, the speed layer would still be producing updates in response to new input immediately. The question is merely what model is used to compute the update.
... View more
10-14-2015
02:23 AM
The short answer is that it does not internalize updates itself. It's an interesting question of design. Of course, an updated model matters when answering queries in the serving layer. When just being used to determine how a new input changes the model, it's not necessarily important to have consumed prior updates to compute a good-enough update. From an implementation perspective, it makes things significantly simpler; the model updates, in general, are intended to be computed in a distributed way with Spark. If they were also updating the model, it'd be hard and slow to also coordinate those in-memory updates meaningfully. The price of course is that the speed layer itself isn't learning in real time, even if that isn't actually nearly as important as the serving layer doing so. Now, interestingly, for ALS, the model itself is so big that the updates themselves can't be computed in a distributed way. It's actually done serially on the driver, on one big copy in memory. So it would be realistic to apply model updates as it goes. I'm going to file this as a "to do" to think about further since it's also the model where it matters more than others. It also occurs to me that, for this reason, the driver should multi-thread its computation of the updates for ALS. Also a to-do
... View more
09-22-2015
03:47 AM
You may be using a different version of Jackson, yes. The point is to put your version in your app's classloader, which is not the same as Spark's classloader. This can still be problematic, but in theory, the isolation means the versions used are isolated and don't interfere.
... View more
09-03-2015
03:26 PM
Timestamp is for ordering, and for determining decay of the strength factor. The ordering of events is not guaranteed by HDFS / Kafka, and does matter to some extent, especially if there are 'delete' events. It also matters when figuring out how old a data point is and how much its value has decayed, if it's enabled. You could use seconds or milliseconds, I suppose, if you used them consistently. However the serving layer uses a standard ms timestamp, so that's probably best to emulate.
... View more
09-02-2015
03:27 PM
2 Kudos
I haven't tried those packages in a while (not since R 3.2 at least), but I know they haven't been updated in a while: https://github.com/RevolutionAnalytics/rhdfs/releases It wouldn't surprise me if they're not maintained now, especially given Revolution is probably shifting gears now that they're part of MSFT. I don't know. It's really a question for Revo or those open source projects. Not sure if it helps, but here's a way you could use local Hadoop binaries to read from HDFS and then just pipe the result into R. Edit your ~/.Renviron to set up Hadoop env variables. For me on my Mac it's: HADOOP_CMD=/usr/local/Cellar/hadoop/2.7.1/bin/hadoop
HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.1/
HADOOP_CONF_DIR=/Users/srowen/Documents/Cloudera/hadoop-conf/ where hadoop-conf is a copy of the config directory from my cluster. Then in R something like: data <- read.csv(pipe("hdfs dfs -cat /user/sowen/data/part-*"), header=FALSE)) You get the idea. For rmr2: I'd suggest you don't really want to run MapReduce. 🙂 It's pretty easy to trigger an R script from, say, a Spark task and parallelize a bunch of R scripts across the cluster with its "pipe" command. That's roughly what rmr2 helps you do. You still have to set up R across the cluster. There's also SparkR on the way, but still pretty green.
... View more
09-02-2015
08:34 AM
1 Kudo
This kind of thing means you've got a mismatch somewhere between library versions. Are you packaging a different version of Mahout with your app? do you have old copies of the library somewhere on a classpath?
... View more
09-02-2015
05:10 AM
How many executors are you requesting, and with how much memory / how many cores? those are command line options to spark-submit or spark-shell. In your YARN Resource Manager, have a look at how much memory/cores you have access to and how much is used. This is in the UI. Look at your YARN config too in the settings regarding Resource Management, and the 'container max' settings controlling how many cores and how much memory YARN is willing to give to any one container. These together will help figure out if indeed there's just a mismatch between how much you're asking for in Spark and how much you have made available.
... View more