About srowen

srowen · ‎05-17-2016

Yes, that sounds right, though I confess I haven't tried that myself. Others here may have better suggestions.

srowen · ‎03-09-2016

That's not what it says; it say they just aren't supported, typically because they're not "supported" in Spark either (e.g. experimental API). Supported != doesn't work, just means you can't file a support ticket for it. CDH 5.6 = Spark 1.5 + patches, meaning it's like 1.5.2 likely with a slightly different set of maintenance patches. It might not have unimportant ones that maybe shouldn't be in a maintenance release, or might have a critical one that was created after 1.5.2. Generally speaking there are no other differences; it's just upstream Spark with some tinkering with versions to make it integrate with other Hadoop components correctly. The exception is SparkR, which isn't even shipped, partly because CDH can't ship R itself.

srowen · ‎03-07-2016

It includes an implementation of classification using random decision forests. Decision forests actually support both categorical and numeric features. However, for text classification, you're correct that you typically transform your text into numeric vectors via TF-IDF first. This is something you'd have to do separately. Yes, the dimensionality is high. Decision forests can be fine with this, but, they're not the most natural choice for text classification. You may see what I mean that Oryx is not a tool for classification, but a tool for productionizing, which happens to have an implementation of a classifier. In 2.x, you also have an implementation of decision forests, and also don't have magic TF-IDF built in or anything. However the architecture is much more supportive of putting your own Spark-based pipeline and model build into the framework. 1.x did not support this.

srowen · ‎03-06-2016

Yes, though I would describe Oryx as support for productionizing some kind of learning system. Just making a model is something you should do with other tools whose purpose is to build models. Oryx 1 is not exactly deprecated, but Oryx 2 is the only version in active development, and I'd really encourage you to look there. The good news is that it's a lot easier in 2.x to reuse a model building process you created in, say, Spark. In 1.x it's not possible.

srowen · ‎02-29-2016

The error points to the problem -- you have perhaps plenty of memory but not enough permgen space in the JVM. Try something like -XX:MaxPermSize=2g in your JVM options to executors

srowen · ‎02-22-2016

"Not supported" means you can't file support tickets for it. It's shipped and works though.

srowen · ‎02-22-2016

Yes, always has been. You can inspect the assembly JAR or just try using it in the shell -- did you try it? CDH ships all of Spark in general. The only thing I think isn't included in SparkR's R bits.

srowen · ‎01-20-2016

The best resource is probably the web site at http://oryx.io as well as the source code. http://github.com/OryxProject/oryx

srowen · ‎01-19-2016

It depends a lot on what you mean by "mature". It is mature in the sense that it's: - a fifth (!) generation architecture - used by real customers in production, yes - not at a beta stage, but at a 2.1.1 release now - built on technologies that Cloudera supports as production ready it's not mature in the sense that it's: - is not itself formally supported by Cloudera -- only a labs project - is still fairly new in its current 2.x form, having finished about 5 months ago I think the architecture is certainly the way to go if you're building this kind of thing on Hadoop. See http://oryx.io. I don't know much about Samoa but I understand it to be a distributed stream-centric ML library. On the plus side, it's probably better than anything in Spark for building huge models incrementally, as this is what it focuses on. On the downside, it doesn't do the model serving element, which Oryx tries to provide, and in a sense Samoa is a much less standard technology than Spark, HDFS, and Kafka.

srowen · ‎12-16-2015

Yes, I think that begins to narrow it down. I don't know that you're going to find a big performance difference, since distributions will generally ship the upstream project with only minimal modifications to integrate it. (That said, CDH does let you enable native acceleration for some mathematical operations in Spark MLlib. I don't think other distros enable this and ship the right libraries. It's possible that could matter to your use case.) I'd look at how recent the Spark distribution is. Cloudera ships Spark 1.5 in CDH 5.5; MapR is on 1.4 and Hortonworks on 1.3, with a beta preview of 1.5 at the moment in both cases. We're already integrating the nearly-released Spark 1.6 too. Finally, if you're considering paying for support, I think it bears evaluating how much each vendor invests in Spark. No investment means no expertise and no real ability to fix your problems. At Cloudera, we have a full-time team on Spark, including 4 committers (including me). I think you'll find other vendors virtually non-existent in the Spark community, but, go see for yourself.

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Idle Spark Shells

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Maturity ORYX

Re: Benchmark Cloudera, hortonworks and MapR