Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3437 | 01-26-2018 04:02 AM | |
7076 | 12-22-2017 09:18 AM | |
3532 | 12-05-2017 06:13 AM | |
3847 | 10-16-2017 07:55 AM | |
11175 | 10-04-2017 08:08 PM |
06-22-2015
01:37 AM
Sure guys, let me know if it seems to work. Once this is resolved I am going to cut a 1.1.0 release.
... View more
06-14-2015
12:45 PM
1 Kudo
This concerns version 1.x by the way. The config elements in question are here: https://github.com/cloudera/oryx/blob/master/common/src/main/resources/reference.conf#L136
... View more
06-02-2015
08:34 AM
Yes, the number of splits and therefore Mapper tasks is determined by Hadoop MapReduce and this is not altered or overridden. 11 is a default number of Reducer tasks which you can change. (For various reasons a prime number is a good choice.) Yes, you will see as many run simultaneously as you have reducer slots. This is determined by MapReduce and defaults to 1 per machine but can be changed if you know the machine can handle many more. This is all just Hadoop machinery, yeah, not specific to this app.
... View more
05-29-2015
12:50 AM
Yes, that's a good reason, if you have to scale up past one machine. Previously I thought you mean you were running an entire Hadoop cluster on one machine, which is fine for a test but much slower and more complex than a simple non-Hadoop 1-machine setup. I The mapper and reducer will need more memory if you see them running out of memory. If memory is very low but not exhausted, a Java process slows down in too much GC. Otherwise more memory does not help. More nodes does not necessarily help. You still face the overhead of task scheduling and data transfer, and the time taken to do non-distributed work. In fact, if you set up your workers to not live on the same nodes as data nodes, it will be a lot slower. For your scale, which fits in one machine easily, 7 nodes is big overkill, and 60 is way too big to provide any advantage. You're measuring pure Hadoop overhead, which you can tune, but is not reflecting work done. The upshot is you should be able to handle data sizes hundreds or thousands of times larger this way, at roughly the same amount of time. For small data sets, you see why there is no value in trying to use a large cluster; it's just too tiny to split up.
... View more
05-26-2015
06:11 PM
In CM & CDH 5.4 you should unset it and let it use the one that is there on the nodes. Much faster. Wilfred
... View more
05-25-2015
11:02 AM
I don't think maintenance releases get released as such with CDH for any component, since the release cycle and customer demand for maintenance releases are different from upstream. Important fixes are backported though, so you already have some of 1.3.1 and beyond in the 1.3.x branch in CDH. The changes aren't different; they come from upstream. Minor releases rebase on upstream minor releases and so 'sync' at that point (i.e. CDH 5.5 should have the latest minor release, whether it's 1.4.x or 1.5.x)
... View more
05-22-2015
12:23 AM
1 Kudo
ALS: yes, fold-in just as before k-means: assign point to a cluster and update its centroid (but don't reassign any other points) RDF: assign point to leaf and update leaf's prediction (but don't change the rest of the tree)
... View more
05-13-2015
03:19 PM
Basically, I have to instantiate these steps via a CP API Python script: To add the History Server: 1.Go to the Spark service. 2.Click the Instances tab. 3.Click the Add Role Instances button. 4.Select a host in the column under History Server, then click OK. 5.Click Continue. 6.Check the checkbox next to the History Server role. 7.Select Actions for Selected > Start and click Start. 8.Click Close when the action completes.
... View more
04-24-2015
01:34 PM
I'm also seeing this error. Strangely I am including the jar in the spark submit command: /usr/bin/spark-submit --class com.mycompany.myproduct.spark.sparkhive.Hive2RddTest --master spark://mycluster:7077 --executor-memory 8G --jars hive-common-0.13.1-cdh5.3.1.jar sparkhive.jar "/home/stunos/hive.json" & Is this insufficient to add this to the classpath? It has worked for other dependencies so presumably spark copies the dependencies to the other nodes. I am puzzled by this exception. I can attempt to add this jar to /opt/cloudera/parcels/CDH/spark/lib on each node but at this point it is only a voodoo guess since by my logic the command line argument should have been sufficient. What do you think? Does this mean I probably have to build Spark?
... View more