Member since
07-29-2013
366
Posts
69
Kudos Received
71
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3305 | 03-09-2016 01:21 AM | |
3253 | 03-07-2016 01:52 AM | |
10644 | 02-29-2016 04:40 AM | |
2837 | 02-22-2016 03:08 PM | |
3837 | 01-19-2016 02:13 PM |
05-17-2016
01:41 PM
Yes, that sounds right, though I confess I haven't tried that myself. Others here may have better suggestions.
... View more
03-09-2016
01:21 AM
1 Kudo
That's not what it says; it say they just aren't supported, typically because they're not "supported" in Spark either (e.g. experimental API). Supported != doesn't work, just means you can't file a support ticket for it. CDH 5.6 = Spark 1.5 + patches, meaning it's like 1.5.2 likely with a slightly different set of maintenance patches. It might not have unimportant ones that maybe shouldn't be in a maintenance release, or might have a critical one that was created after 1.5.2. Generally speaking there are no other differences; it's just upstream Spark with some tinkering with versions to make it integrate with other Hadoop components correctly. The exception is SparkR, which isn't even shipped, partly because CDH can't ship R itself.
... View more
03-07-2016
01:52 AM
It includes an implementation of classification using random decision forests. Decision forests actually support both categorical and numeric features. However, for text classification, you're correct that you typically transform your text into numeric vectors via TF-IDF first. This is something you'd have to do separately. Yes, the dimensionality is high. Decision forests can be fine with this, but, they're not the most natural choice for text classification. You may see what I mean that Oryx is not a tool for classification, but a tool for productionizing, which happens to have an implementation of a classifier. In 2.x, you also have an implementation of decision forests, and also don't have magic TF-IDF built in or anything. However the architecture is much more supportive of putting your own Spark-based pipeline and model build into the framework. 1.x did not support this.
... View more
03-06-2016
11:34 AM
Yes, though I would describe Oryx as support for productionizing some kind of learning system. Just making a model is something you should do with other tools whose purpose is to build models. Oryx 1 is not exactly deprecated, but Oryx 2 is the only version in active development, and I'd really encourage you to look there. The good news is that it's a lot easier in 2.x to reuse a model building process you created in, say, Spark. In 1.x it's not possible.
... View more
02-29-2016
04:40 AM
The error points to the problem -- you have perhaps plenty of memory but not enough permgen space in the JVM. Try something like -XX:MaxPermSize=2g in your JVM options to executors
... View more
02-22-2016
03:08 PM
"Not supported" means you can't file support tickets for it. It's shipped and works though.
... View more
02-22-2016
02:35 PM
Yes, always has been. You can inspect the assembly JAR or just try using it in the shell -- did you try it? CDH ships all of Spark in general. The only thing I think isn't included in SparkR's R bits.
... View more
01-20-2016
02:55 AM
1 Kudo
The best resource is probably the web site at http://oryx.io as well as the source code. http://github.com/OryxProject/oryx
... View more
01-19-2016
02:13 PM
2 Kudos
It depends a lot on what you mean by "mature". It is mature in the sense that it's: - a fifth (!) generation architecture - used by real customers in production, yes - not at a beta stage, but at a 2.1.1 release now - built on technologies that Cloudera supports as production ready it's not mature in the sense that it's: - is not itself formally supported by Cloudera -- only a labs project - is still fairly new in its current 2.x form, having finished about 5 months ago I think the architecture is certainly the way to go if you're building this kind of thing on Hadoop. See http://oryx.io. I don't know much about Samoa but I understand it to be a distributed stream-centric ML library. On the plus side, it's probably better than anything in Spark for building huge models incrementally, as this is what it focuses on. On the downside, it doesn't do the model serving element, which Oryx tries to provide, and in a sense Samoa is a much less standard technology than Spark, HDFS, and Kafka.
... View more
12-16-2015
03:23 AM
1 Kudo
Yes, I think that begins to narrow it down. I don't know that you're going to find a big performance difference, since distributions will generally ship the upstream project with only minimal modifications to integrate it. (That said, CDH does let you enable native acceleration for some mathematical operations in Spark MLlib. I don't think other distros enable this and ship the right libraries. It's possible that could matter to your use case.) I'd look at how recent the Spark distribution is. Cloudera ships Spark 1.5 in CDH 5.5; MapR is on 1.4 and Hortonworks on 1.3, with a beta preview of 1.5 at the moment in both cases. We're already integrating the nearly-released Spark 1.6 too. Finally, if you're considering paying for support, I think it bears evaluating how much each vendor invests in Spark. No investment means no expertise and no real ability to fix your problems. At Cloudera, we have a full-time team on Spark, including 4 committers (including me). I think you'll find other vendors virtually non-existent in the Spark community, but, go see for yourself.
... View more
12-16-2015
02:46 AM
First, you'd have to define what you're trying to "benchmark". I don't think these distributions vary in speed; they include reasonably different components around the core. That is, it's kind of like choosing a car solely by its max RPM or something, even if that's important to you.
... View more
10-09-2015
01:18 AM
One quick question -- are you running on Windows?
... View more
09-30-2015
11:46 AM
It's possible to just use a static Executor in your code and use it to run multi-threaded operations within each function call. This may not be efficient though. If your goal is simply full utilization of cores, then make sure you have enough executors with enough cores running to use all of your cluster. Then make sure your number of partitions is at least this large. Then each operation can be single-threaded.
... View more
09-30-2015
09:42 AM
You can't use RDDs inside functions on RDD executing remotely, which may be what you're doing. Otherwise i'm not clear what you are executing? I suspect you are doing something that does not work in general in Spark, but may happen to when executing locally in 1 JVM.
... View more
09-28-2015
09:07 AM
You won't be able to read a local file with this code. You are still trying to read from the classpath. I mean this would also have to change to read a file locally.
... View more
09-27-2015
03:09 AM
Here, you're using your own build of Spark against an older version of Hive than what's in CDH. That might mostly work, but you're seeing the problems in compiling and running vs different versions. I'm afraid you're on your own if you're rolling your own build, but, I expect you may get much closer if you make a build targeting the same HIve version in CDH.
... View more
09-25-2015
09:47 AM
The relationship of .jars and classloaders may not be the same as in local mode, such that this may not work as expected. Instead of depending on this, consider either distributing your file via HDFS, or using the --files option with Spark to distribute files to local disk: http://spark.apache.org/docs/latest/running-on-yarn.html
... View more
09-22-2015
06:06 AM
1 Kudo
I remember some problems with snappy and HBase like this, like somehow an older version used by HBase ended up taking precedence in the app classloader and then it could not quite load properly, as it couldn't see the shared library in the parent classloader. This may be a manifestation of that one. I know there are certainly cases where there is no resolution to the conflict, since an app and Spark may use mutually incompatible versions of a dependency, and one will mess with the other if the Spark and app classloader are connected, no matter what their ordering. For this toy example, you'd just not set the classpath setting since it isn't needed. For your app, if neither combination works, then your options are probably to harmonize library versions with Spark, or shade your copy of the library.
... View more
09-22-2015
05:49 AM
Hm, but have you modified classpath.txt? IIRC the last time I saw this it was some strange problem with the snappy from HBase and one used by other things like Spark. Does it work without the userClassPathFirst arg? Just trying to narrow it down. This is always a problem territory, turning on this flag, but that's a simple example with no obvious reason it shouldn't work.
... View more
09-22-2015
04:30 AM
That's a different type of conflict. You have somehow a different version of snappy in your app classpath, maybe? You aren't including Spark/Hadoop in your app jar right? The Spark assembly only contains Hadoop jars if built that way, but in a CDH cluster, that's not a good idea, as the cluster already has its copy of Hadoop stuff. It's built as 'hadoop-provided' and the classpath then contains Hadoop jars and dependencies, plus Spark's. Modifying this means modifying the distribution for all applications. It may or may not work with the rest of CDH and may or may not work with other apps. These modifications aren't supported, though you can try whatever you want if you are OK with 'voiding the warranty' so to speak. Spark classpath issues are tricky in general, not just in CDH, since Spark uses a load of libraries and doesn't shade most of them. Yes, you can try shading your own copies as a fall-back if the classpath-first args don't work. But you might need to double-check what you are trying to bring in.
... View more
09-22-2015
02:36 AM
I wouldn't modify that file. Instead, include your libraries with your app or using --jars, but also try setting spark.{driver,executor}.userClassPathFirst to true. Resolving these conflicts is tricky in Spark, where you use a library that Spark does too and does not shade, but this is the answer in most cases.
... View more
09-19-2015
06:40 AM
Replications is an HDFS-level configuration. It isn't something you configure from Spark, and you don't have to worry about it from Spark. AFAIK you set a global replication factor, but can set it per directory too. I think you want to pursue this via HDFS.
... View more
09-17-2015
02:09 AM
1 Kudo
I suppose you can cluster term vectors in V S for this purpose, to discover related terms and thus topics. This is the type of problem where you might more usually use LDA. I know you're using Mahout, but if you ever consider using Spark, there's a chapter on exactly this in our book: http://shop.oreilly.com/product/0636920035091.do
... View more
09-17-2015
01:30 AM
1 Kudo
The output is as you say -- these are the products of the SVD. You can do what you want with them, and it depends on what you're trying to achieve. You can look at the matrix V S to study term similarities, or U S to discover document similarities for example.
... View more
09-14-2015
09:05 AM
1 Kudo
That would be unsupported. I think you'd find support would try to help in this case, but if it legitimately looked like a problem with Spark 1.4, would decline to pursue it. Spark 1.5 is supported in CDH 5.5 of course, coming soon.
... View more
09-13-2015
12:04 PM
In general it means executors need more memory, but it's a fairly complex question. Maybe you need smaller tasks so that peak memory usage is lower. Maybe cache less or use lower max cache level. Or more executor memory. Maybe at the margins better GC settings. Usually the place to start is deciding whether your computation is inherently going to scale badly and run out of memory in a certain stage.
... View more
09-13-2015
12:36 AM
This much basically says "the executor stopped for some reason". You'd have to dig in to the application via YARN, and click through to its entry in the history server, to browse those logs, and see if you can find exceptions in the executor log. It sounds like it stopped responding. As a guess, you might be out of memory and stuck in GC thrashing.
... View more
01-15-2015
07:53 PM
1 Kudo
Is this library bundled with your app? One guess would be that it is not, and happens to be on the classpath from another dependency on the driver, but is not accidentally found on the executors.
... View more
01-05-2015
03:22 AM
No it is not the same 'because' computation in the paper. The one in the paper is better. However it requires storing a k x k matrix for every user, or computing it on the fly, both of which are pretty prohibitive. They're not hard to implement though. This is a cheap-o non-personalized computation based on item similarity. No, the system does not serve the original data, just results from the factored model. It's assumed that, if the caller needs this info, the caller has it, and its purpose is generally not specific to the core recommender, so accessing this data is not part of the engine.
... View more
12-23-2014
12:47 AM
That's fine. The machine needs to be able to communicate with the cluster of course. Usually you would make the Hadoop configuration visible as well and point to it with HADOOP_CONF_DIR. I think that will be required to get MapReduce to work.
... View more