Member since
07-29-2013
366
Posts
69
Kudos Received
71
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6199 | 03-09-2016 01:21 AM | |
5034 | 03-07-2016 01:52 AM | |
15084 | 02-29-2016 04:40 AM | |
4747 | 02-22-2016 03:08 PM | |
5750 | 01-19-2016 02:13 PM |
10-22-2014
05:09 AM
That's really what IDRescorer is for, yes. If you need it in distributed mode you can reimplement the same idea by changing the code. I don't think it's really a clustering problem; you're just filtering based on clear attributes. You could also think of it a search relevance problem, and combine the results of a recommender and search engine in your app. No, ALS has no concept of attributes. It's a different, longer story, but you can always use 'fake' users and items corresponding to topics or labels to inject this information in the ALS model.
... View more
10-22-2014
03:21 AM
How are you running it? do any other jobs work? how about the shell? I see a possible typo in "hostnamemaster49910", like a colon is missing. The host string is given differently several times.
... View more
10-22-2014
03:19 AM
No, IDRescorer has always been a part only of the non-distributed implementation.
... View more
10-15-2014
12:12 AM
You are using standalone mode, i.e. the "Spark" service and not YARN? Check to see that the workers are running and healthy. Did executors register at startup? Double-check they have the memory you think. If not, they may not be accepting work because they do not have the memory to allocate that you expect.
... View more
10-15-2014
12:09 AM
TallSkinnySVD calls RowMatrix.computeSVD, and by default it will decide whether to run the computation locally or not. The defaults may be causing the driver to run the computation only, depending on your data.
... View more
10-14-2014
11:56 PM
Yes, I don't think these examples are part of the runtime platform. You would need to bring the examples with your app. This sounds like the streaming jars aren't somehow part of your distribution. Is there anything custom about your deployment of CDH? the streaming classes should be found as part of the distribution, and that's what's missing.
... View more
10-10-2014
09:06 AM
Yes, that's all correct. Set time-threshold to 24 hours (1440 minutes) to rebuild once a day, regardless of the amount of data that has been written. Yes, the amount of data used to build the model is always increasing (unless you are manually deleting data, or, decaying data). It does sum up all counts for each user-item pair, so it is somewhat compacted this way.
... View more
10-10-2014
01:55 AM
There is a Coursera course on Scala right now -- you can still watch the videos although it started weeks ago: https://www.coursera.org/course/progfun There are a number of examples and tutorials on the web concerning Spark. Really, take your pick after searching Google. Here's a blog post I wrote with a quick example: http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/
... View more
10-10-2014
01:36 AM
Scala is the native language of Spark. All else equal, it will be easiest to use Spark in Scala. However, of course, not everyone knows Scala or is using it in other projects. Using it from Java is only slightly less convenient. You will write more code since Java's handling of anonymous classes is quite verbose before Java 8. All of the Scala APIs can be called from Java too, although some look weird when accessed from Java. Most APIs have a Java-friendlier version where necessary to ease this integration. Python is probably the least easy to use since it is not JVM-based. There is a runtime overhead to translating back and forth between Spark and Python. Not all APIs are 'translated' to Python. Still, it works, and is useful if, well, you know Python and want to use it.
... View more
10-05-2014
06:21 AM
Yes, that's it. Can you delete the model while it's running? no, it would leave the instance unable to serve anything, so I hadn't seen the point of that versus just shutting down the server. No you can't rebuild with different params without restarting, although the computation layer is the thing that rebuilds the model. The serving layer just loads what it is given. In the next version (the 2.x rewrite) it will try to find the best params automatically anyway.
... View more