About srowen

srowen · ‎10-22-2014

That's really what IDRescorer is for, yes. If you need it in distributed mode you can reimplement the same idea by changing the code. I don't think it's really a clustering problem; you're just filtering based on clear attributes. You could also think of it a search relevance problem, and combine the results of a recommender and search engine in your app. No, ALS has no concept of attributes. It's a different, longer story, but you can always use 'fake' users and items corresponding to topics or labels to inject this information in the ALS model.

srowen · ‎10-22-2014

How are you running it? do any other jobs work? how about the shell? I see a possible typo in "hostnamemaster49910", like a colon is missing. The host string is given differently several times.

srowen · ‎10-22-2014

No, IDRescorer has always been a part only of the non-distributed implementation.

srowen · ‎10-15-2014

You are using standalone mode, i.e. the "Spark" service and not YARN? Check to see that the workers are running and healthy. Did executors register at startup? Double-check they have the memory you think. If not, they may not be accepting work because they do not have the memory to allocate that you expect.

srowen · ‎10-15-2014

TallSkinnySVD calls RowMatrix.computeSVD, and by default it will decide whether to run the computation locally or not. The defaults may be causing the driver to run the computation only, depending on your data.

srowen · ‎10-14-2014

Yes, I don't think these examples are part of the runtime platform. You would need to bring the examples with your app. This sounds like the streaming jars aren't somehow part of your distribution. Is there anything custom about your deployment of CDH? the streaming classes should be found as part of the distribution, and that's what's missing.

srowen · ‎10-10-2014

Yes, that's all correct. Set time-threshold to 24 hours (1440 minutes) to rebuild once a day, regardless of the amount of data that has been written. Yes, the amount of data used to build the model is always increasing (unless you are manually deleting data, or, decaying data). It does sum up all counts for each user-item pair, so it is somewhat compacted this way.

srowen · ‎10-10-2014

There is a Coursera course on Scala right now -- you can still watch the videos although it started weeks ago: https://www.coursera.org/course/progfun There are a number of examples and tutorials on the web concerning Spark. Really, take your pick after searching Google. Here's a blog post I wrote with a quick example: http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

srowen · ‎10-10-2014

Scala is the native language of Spark. All else equal, it will be easiest to use Spark in Scala. However, of course, not everyone knows Scala or is using it in other projects. Using it from Java is only slightly less convenient. You will write more code since Java's handling of anonymous classes is quite verbose before Java 8. All of the Scala APIs can be called from Java too, although some look weird when accessed from Java. Most APIs have a Java-friendlier version where necessary to ease this integration. Python is probably the least easy to use since it is not JVM-based. There is a runtime overhead to translating back and forth between Spark and Python. Not all APIs are 'translated' to Python. Still, it works, and is useful if, well, you know Python and want to use it.

srowen · ‎10-05-2014

Yes, that's it. Can you delete the model while it's running? no, it would leave the instance unable to serve anything, so I hadn't seen the point of that versus just shutting down the server. No you can't rebuild with different params without restarting, although the computation layer is the thing that rebuilds the model. The serving layer just loads what it is given. In the next version (the 2.x rewrite) it will try to find the best params automatically anyway.

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Mahout: How to user IDRescorer in Distributed ...

Re: Spark Error Remote

Re: Mahout: How to user IDRescorer in Distributed ...

Re: spark-submit works on single node only

Re: spark-submit works on single node only

Re: org.apache.spark.examples.streaming.JavaKafkaW...

Re: Question on /ingest service

Re: Java or Scala or Python on Spark

Re: Java or Scala or Python on Spark

Re: Question on /ingest service