About srowen

srowen · ‎06-22-2015

Sure guys, let me know if it seems to work. Once this is resolved I am going to cut a 1.1.0 release.

srowen · ‎06-14-2015

This concerns version 1.x by the way. The config elements in question are here: https://github.com/cloudera/oryx/blob/master/common/src/main/resources/reference.conf#L136

srowen · ‎06-02-2015

Yes, the number of splits and therefore Mapper tasks is determined by Hadoop MapReduce and this is not altered or overridden. 11 is a default number of Reducer tasks which you can change. (For various reasons a prime number is a good choice.) Yes, you will see as many run simultaneously as you have reducer slots. This is determined by MapReduce and defaults to 1 per machine but can be changed if you know the machine can handle many more. This is all just Hadoop machinery, yeah, not specific to this app.

srowen · ‎05-29-2015

Yes, that's a good reason, if you have to scale up past one machine. Previously I thought you mean you were running an entire Hadoop cluster on one machine, which is fine for a test but much slower and more complex than a simple non-Hadoop 1-machine setup. I The mapper and reducer will need more memory if you see them running out of memory. If memory is very low but not exhausted, a Java process slows down in too much GC. Otherwise more memory does not help. More nodes does not necessarily help. You still face the overhead of task scheduling and data transfer, and the time taken to do non-distributed work. In fact, if you set up your workers to not live on the same nodes as data nodes, it will be a lot slower. For your scale, which fits in one machine easily, 7 nodes is big overkill, and 60 is way too big to provide any advantage. You're measuring pure Hadoop overhead, which you can tune, but is not reflecting work done. The upshot is you should be able to handle data sizes hundreds or thousands of times larger this way, at roughly the same amount of time. For small data sets, you see why there is no value in trying to use a large cluster; it's just too tiny to split up.

Wilfred · ‎05-26-2015

In CM & CDH 5.4 you should unset it and let it use the one that is there on the nodes. Much faster. Wilfred

srowen · ‎05-25-2015

I don't think maintenance releases get released as such with CDH for any component, since the release cycle and customer demand for maintenance releases are different from upstream. Important fixes are backported though, so you already have some of 1.3.1 and beyond in the 1.3.x branch in CDH. The changes aren't different; they come from upstream. Minor releases rebase on upstream minor releases and so 'sync' at that point (i.e. CDH 5.5 should have the latest minor release, whether it's 1.4.x or 1.5.x)

srowen · ‎05-22-2015

ALS: yes, fold-in just as before k-means: assign point to a cluster and update its centroid (but don't reassign any other points) RDF: assign point to leaf and update leaf's prediction (but don't change the rest of the tree)

TS · ‎05-13-2015

Basically, I have to instantiate these steps via a CP API Python script: To add the History Server: 1.Go to the Spark service. 2.Click the Instances tab. 3.Click the Add Role Instances button. 4.Select a host in the column under History Server, then click OK. 5.Click Continue. 6.Check the checkbox next to the History Server role. 7.Select Actions for Selected > Start and click Start. 8.Click Close when the action completes.

sergey.sheypak566881637 · ‎05-07-2015

Thanks, my fault, missed that remark.

Stunos · ‎04-24-2015

I'm also seeing this error. Strangely I am including the jar in the spark submit command: /usr/bin/spark-submit --class com.mycompany.myproduct.spark.sparkhive.Hive2RddTest --master spark://mycluster:7077 --executor-memory 8G --jars hive-common-0.13.1-cdh5.3.1.jar sparkhive.jar "/home/stunos/hive.json" & Is this insufficient to add this to the classpath? It has worked for other dependencies so presumably spark copies the dependencies to the other nodes. I am puzzled by this exception. I can attempt to add this jar to /opt/cloudera/parcels/CDH/spark/lib on each node but at this point it is only a voodoo guess since by my logic the command line argument should have been sufficient. What do you think? Does this mean I probably have to build Spark?

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Oryx ALS running with Hadoop

Re: Oryx: API method unavailable until model has b...

Re: Tuning Hadoop parameters with Oryx 1.0

Re: Oryx log info of ALS

Re: hdfs:/user/spark/share/lib/spark-assembly.jar ...

Re: can i upgrade spark from 1.3.0 to 1.3.1 in CDH...

Re: Speed layer in Oryx2

Re: Spark History Server: How to install/configure...

Re: backport for SPARK-5967

Re: Spark sql and Hive tables