About srowen

srowen · ‎09-22-2015

I wouldn't modify that file. Instead, include your libraries with your app or using --jars, but also try setting spark.{driver,executor}.userClassPathFirst to true. Resolving these conflicts is tricky in Spark, where you use a library that Spark does too and does not shade, but this is the answer in most cases.

srowen · ‎09-19-2015

Replications is an HDFS-level configuration. It isn't something you configure from Spark, and you don't have to worry about it from Spark. AFAIK you set a global replication factor, but can set it per directory too. I think you want to pursue this via HDFS.

srowen · ‎09-17-2015

I suppose you can cluster term vectors in V S for this purpose, to discover related terms and thus topics. This is the type of problem where you might more usually use LDA. I know you're using Mahout, but if you ever consider using Spark, there's a chapter on exactly this in our book: http://shop.oreilly.com/product/0636920035091.do

srowen · ‎09-17-2015

The output is as you say -- these are the products of the SVD. You can do what you want with them, and it depends on what you're trying to achieve. You can look at the matrix V S to study term similarities, or U S to discover document similarities for example.

srowen · ‎09-14-2015

That would be unsupported. I think you'd find support would try to help in this case, but if it legitimately looked like a problem with Spark 1.4, would decline to pursue it. Spark 1.5 is supported in CDH 5.5 of course, coming soon.

srowen · ‎09-13-2015

In general it means executors need more memory, but it's a fairly complex question. Maybe you need smaller tasks so that peak memory usage is lower. Maybe cache less or use lower max cache level. Or more executor memory. Maybe at the margins better GC settings. Usually the place to start is deciding whether your computation is inherently going to scale badly and run out of memory in a certain stage.

srowen · ‎09-13-2015

This much basically says "the executor stopped for some reason". You'd have to dig in to the application via YARN, and click through to its entry in the history server, to browse those logs, and see if you can find exceptions in the executor log. It sounds like it stopped responding. As a guess, you might be out of memory and stuck in GC thrashing.

srowen · ‎01-15-2015

Is this library bundled with your app? One guess would be that it is not, and happens to be on the classpath from another dependency on the driver, but is not accidentally found on the executors.

srowen · ‎01-05-2015

No it is not the same 'because' computation in the paper. The one in the paper is better. However it requires storing a k x k matrix for every user, or computing it on the fly, both of which are pretty prohibitive. They're not hard to implement though. This is a cheap-o non-personalized computation based on item similarity. No, the system does not serve the original data, just results from the factored model. It's assumed that, if the caller needs this info, the caller has it, and its purpose is generally not specific to the core recommender, so accessing this data is not part of the engine.

srowen · ‎12-23-2014

That's fine. The machine needs to be able to communicate with the cluster of course. Usually you would make the Hadoop configuration visible as well and point to it with HADOOP_CONF_DIR. I think that will be required to get MapReduce to work.

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Override libraries for spark

Re: Write file to HDFS: limit number of datanodes ...

Re: Understanding the mahout SSVD output!

Re: Understanding the mahout SSVD output!

Re: Cloudera Spark 1.4 Support in CDH 5.4

Re: Spark not working when I'm using a big dataset

Re: Spark not working when I'm using a big dataset

Re: java.lang.NoClassDefFoundError: org/json/simpl...

Re: Questions on several API end points and model

Re: Run Oryx on a machine that is not part of the ...