Member since
07-29-2013
366
Posts
69
Kudos Received
71
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5134 | 03-09-2016 01:21 AM | |
4328 | 03-07-2016 01:52 AM | |
13640 | 02-29-2016 04:40 AM | |
4053 | 02-22-2016 03:08 PM | |
5063 | 01-19-2016 02:13 PM |
02-22-2016
03:08 PM
"Not supported" means you can't file support tickets for it. It's shipped and works though.
... View more
01-20-2016
02:55 AM
1 Kudo
The best resource is probably the web site at http://oryx.io as well as the source code. http://github.com/OryxProject/oryx
... View more
10-09-2015
01:27 AM
I got the solution. In my Spark Streaming application I had set SparkConf.setMaster("local[*]") and in spark-submit I was providing --master yarn-cluster. So there was conflict in both the masters and it was remaining in ACCEPTED state and exiting.
... View more
09-30-2015
11:46 AM
It's possible to just use a static Executor in your code and use it to run multi-threaded operations within each function call. This may not be efficient though. If your goal is simply full utilization of cores, then make sure you have enough executors with enough cores running to use all of your cluster. Then make sure your number of partitions is at least this large. Then each operation can be single-threaded.
... View more
09-28-2015
08:48 PM
With the --files option you put the file in your working directory on the executor. You are trying to point to the file using an absolute path which is not what files option does for you. Can you use just the name "rule2.xml" and not a path. When you read the documentation for the files. See the important note at the bottom of the page running on yarn. Also do not use the Resources.getResource() but just use a open of a java construct like: new FileInputStream("rule2.xml") or something like it. Wilfred
... View more
09-27-2015
08:11 AM
Done! I reinstall a new version of Hive: hive 1.2.1. And the job is run well!
... View more
09-19-2015
06:40 AM
Replications is an HDFS-level configuration. It isn't something you configure from Spark, and you don't have to worry about it from Spark. AFAIK you set a global replication factor, but can set it per directory too. I think you want to pursue this via HDFS.
... View more
09-17-2015
02:09 AM
1 Kudo
I suppose you can cluster term vectors in V S for this purpose, to discover related terms and thus topics. This is the type of problem where you might more usually use LDA. I know you're using Mahout, but if you ever consider using Spark, there's a chapter on exactly this in our book: http://shop.oreilly.com/product/0636920035091.do
... View more
09-14-2015
09:53 AM
In your Spark UI do you see it working with a large number of partitions (large number of tasks)? It could be that you are loading all 70G into memory at once if you have a small number of partitions. Also it could be that you have one huge partition with 99% of the data and lots of small ones. Then when Spark processes your huge partition it will load it all into memory. This can happen if you are mapping to a tuple e.g. (x, y) and the key (x) is the same for 99% of the data. Have a look at your Spark UI to see the size of the tasks you are running. It's likely that you will see a small number of tasks, or one huge task and a lot of small ones.
... View more