Member since
07-29-2013
366
Posts
69
Kudos Received
71
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4990 | 03-09-2016 01:21 AM | |
4255 | 03-07-2016 01:52 AM | |
13366 | 02-29-2016 04:40 AM | |
3968 | 02-22-2016 03:08 PM | |
4962 | 01-19-2016 02:13 PM |
05-17-2016
01:41 PM
Yes, that sounds right, though I confess I haven't tried that myself. Others here may have better suggestions.
... View more
03-09-2016
01:21 AM
1 Kudo
That's not what it says; it say they just aren't supported, typically because they're not "supported" in Spark either (e.g. experimental API). Supported != doesn't work, just means you can't file a support ticket for it. CDH 5.6 = Spark 1.5 + patches, meaning it's like 1.5.2 likely with a slightly different set of maintenance patches. It might not have unimportant ones that maybe shouldn't be in a maintenance release, or might have a critical one that was created after 1.5.2. Generally speaking there are no other differences; it's just upstream Spark with some tinkering with versions to make it integrate with other Hadoop components correctly. The exception is SparkR, which isn't even shipped, partly because CDH can't ship R itself.
... View more
03-07-2016
01:52 AM
It includes an implementation of classification using random decision forests. Decision forests actually support both categorical and numeric features. However, for text classification, you're correct that you typically transform your text into numeric vectors via TF-IDF first. This is something you'd have to do separately. Yes, the dimensionality is high. Decision forests can be fine with this, but, they're not the most natural choice for text classification. You may see what I mean that Oryx is not a tool for classification, but a tool for productionizing, which happens to have an implementation of a classifier. In 2.x, you also have an implementation of decision forests, and also don't have magic TF-IDF built in or anything. However the architecture is much more supportive of putting your own Spark-based pipeline and model build into the framework. 1.x did not support this.
... View more
03-06-2016
11:34 AM
Yes, though I would describe Oryx as support for productionizing some kind of learning system. Just making a model is something you should do with other tools whose purpose is to build models. Oryx 1 is not exactly deprecated, but Oryx 2 is the only version in active development, and I'd really encourage you to look there. The good news is that it's a lot easier in 2.x to reuse a model building process you created in, say, Spark. In 1.x it's not possible.
... View more
02-29-2016
04:40 AM
The error points to the problem -- you have perhaps plenty of memory but not enough permgen space in the JVM. Try something like -XX:MaxPermSize=2g in your JVM options to executors
... View more
02-22-2016
03:08 PM
"Not supported" means you can't file support tickets for it. It's shipped and works though.
... View more
02-22-2016
02:35 PM
Yes, always has been. You can inspect the assembly JAR or just try using it in the shell -- did you try it? CDH ships all of Spark in general. The only thing I think isn't included in SparkR's R bits.
... View more
01-20-2016
02:55 AM
1 Kudo
The best resource is probably the web site at http://oryx.io as well as the source code. http://github.com/OryxProject/oryx
... View more
01-19-2016
02:13 PM
2 Kudos
It depends a lot on what you mean by "mature". It is mature in the sense that it's: - a fifth (!) generation architecture - used by real customers in production, yes - not at a beta stage, but at a 2.1.1 release now - built on technologies that Cloudera supports as production ready it's not mature in the sense that it's: - is not itself formally supported by Cloudera -- only a labs project - is still fairly new in its current 2.x form, having finished about 5 months ago I think the architecture is certainly the way to go if you're building this kind of thing on Hadoop. See http://oryx.io. I don't know much about Samoa but I understand it to be a distributed stream-centric ML library. On the plus side, it's probably better than anything in Spark for building huge models incrementally, as this is what it focuses on. On the downside, it doesn't do the model serving element, which Oryx tries to provide, and in a sense Samoa is a much less standard technology than Spark, HDFS, and Kafka.
... View more
12-16-2015
03:23 AM
1 Kudo
Yes, I think that begins to narrow it down. I don't know that you're going to find a big performance difference, since distributions will generally ship the upstream project with only minimal modifications to integrate it. (That said, CDH does let you enable native acceleration for some mathematical operations in Spark MLlib. I don't think other distros enable this and ship the right libraries. It's possible that could matter to your use case.) I'd look at how recent the Spark distribution is. Cloudera ships Spark 1.5 in CDH 5.5; MapR is on 1.4 and Hortonworks on 1.3, with a beta preview of 1.5 at the moment in both cases. We're already integrating the nearly-released Spark 1.6 too. Finally, if you're considering paying for support, I think it bears evaluating how much each vendor invests in Spark. No investment means no expertise and no real ability to fix your problems. At Cloudera, we have a full-time team on Spark, including 4 committers (including me). I think you'll find other vendors virtually non-existent in the Spark community, but, go see for yourself.
... View more