About srowen

srowen · ‎01-13-2016

Have a look at this minimal get-started example: http://oryxproject.github.io/oryx/docs/endusers.html

srowen · ‎10-22-2015

Please give more information than just a command line and stack trace. You don't have "SimpleApp" in that JAR, or it's not the fully qualified name of the class.

srowen · ‎10-14-2015

It's tricky because in general the ALS implementation we are talking about is a special case compared to normal models, but it's a big special case. I think the general architecture is correct at the level it's presented. I don't want to complicate it too much. Your feedback is a value contribution. Problems and bug fixes are important, but also ways the architecture could be improved or opened up.

srowen · ‎10-14-2015

Oh! on re-reading this, I realize it already consumes its own updates, actually. It took a moment of reading this to recall the architecture. I should really add a note in the source code. The setUserVector/setItemVector you see is actually where it consumes updates from the batch layer. The batch layer generally does not produce updates of course, but this is again a special case. The ALS model is so large that it has to be shipped around as a huge set of updates. This is tidy. But, this also means it is listening to its own updates and processing them in exactly the same way. So -- at a short delay -- it is hearing its own updates and applying them. Even if this were not so, the speed layer would still be producing updates in response to new input immediately. The question is merely what model is used to compute the update.

srowen · ‎10-14-2015

The short answer is that it does not internalize updates itself. It's an interesting question of design. Of course, an updated model matters when answering queries in the serving layer. When just being used to determine how a new input changes the model, it's not necessarily important to have consumed prior updates to compute a good-enough update. From an implementation perspective, it makes things significantly simpler; the model updates, in general, are intended to be computed in a distributed way with Spark. If they were also updating the model, it'd be hard and slow to also coordinate those in-memory updates meaningfully. The price of course is that the speed layer itself isn't learning in real time, even if that isn't actually nearly as important as the serving layer doing so. Now, interestingly, for ALS, the model itself is so big that the updates themselves can't be computed in a distributed way. It's actually done serially on the driver, on one big copy in memory. So it would be realistic to apply model updates as it goes. I'm going to file this as a "to do" to think about further since it's also the model where it matters more than others. It also occurs to me that, for this reason, the driver should multi-thread its computation of the updates for ALS. Also a to-do

srowen · ‎09-22-2015

You may be using a different version of Jackson, yes. The point is to put your version in your app's classloader, which is not the same as Spark's classloader. This can still be problematic, but in theory, the isolation means the versions used are isolated and don't interfere.

srowen · ‎09-03-2015

Timestamp is for ordering, and for determining decay of the strength factor. The ordering of events is not guaranteed by HDFS / Kafka, and does matter to some extent, especially if there are 'delete' events. It also matters when figuring out how old a data point is and how much its value has decayed, if it's enabled. You could use seconds or milliseconds, I suppose, if you used them consistently. However the serving layer uses a standard ms timestamp, so that's probably best to emulate.

srowen · ‎09-02-2015

I haven't tried those packages in a while (not since R 3.2 at least), but I know they haven't been updated in a while: https://github.com/RevolutionAnalytics/rhdfs/releases It wouldn't surprise me if they're not maintained now, especially given Revolution is probably shifting gears now that they're part of MSFT. I don't know. It's really a question for Revo or those open source projects. Not sure if it helps, but here's a way you could use local Hadoop binaries to read from HDFS and then just pipe the result into R. Edit your ~/.Renviron to set up Hadoop env variables. For me on my Mac it's: HADOOP_CMD=/usr/local/Cellar/hadoop/2.7.1/bin/hadoop HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.1/ HADOOP_CONF_DIR=/Users/srowen/Documents/Cloudera/hadoop-conf/ where hadoop-conf is a copy of the config directory from my cluster. Then in R something like: data <- read.csv(pipe("hdfs dfs -cat /user/sowen/data/part-*"), header=FALSE)) You get the idea. For rmr2: I'd suggest you don't really want to run MapReduce. 🙂 It's pretty easy to trigger an R script from, say, a Spark task and parallelize a bunch of R scripts across the cluster with its "pipe" command. That's roughly what rmr2 helps you do. You still have to set up R across the cluster. There's also SparkR on the way, but still pretty green.

srowen · ‎09-02-2015

This kind of thing means you've got a mismatch somewhere between library versions. Are you packaging a different version of Mahout with your app? do you have old copies of the library somewhere on a classpath?

srowen · ‎09-02-2015

How many executors are you requesting, and with how much memory / how many cores? those are command line options to spark-submit or spark-shell. In your YARN Resource Manager, have a look at how much memory/cores you have access to and how much is used. This is in the UI. Look at your YARN config too in the settings regarding Resource Management, and the 'container max' settings controlling how many cores and how much memory YARN is willing to give to any one container. These together will help figure out if indeed there's just a mismatch between how much you're asking for in Spark and how much you have made available.

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Oryx 2 with cloudera 5.3

Re: spark-submit exception

Re: How Speed Layer Internalizes Updates..!!

Re: How Speed Layer Internalizes Updates..!!

Re: How Speed Layer Internalizes Updates..!!

Re: spark-submit on additional machine

Re: Overall questions about Oryx 2

Re: RHadoop on CDH5.3, rmr2 and rhdfs packages ar...

Re: If running a Mahout DistributedLanczosSolver J...

Re: Endless INFO Client: Application report for ap...