About srowen

srowen · ‎02-22-2016

"Not supported" means you can't file support tickets for it. It's shipped and works though.

srowen · ‎01-20-2016

The best resource is probably the web site at http://oryx.io as well as the source code. http://github.com/OryxProject/oryx

tsunami20 · ‎12-17-2015

is it possible with Spark to handle big data cleansing ?

hello · ‎10-09-2015

I got the solution. In my Spark Streaming application I had set SparkConf.setMaster("local[*]") and in spark-submit I was providing --master yarn-cluster. So there was conflict in both the masters and it was remaining in ACCEPTED state and exiting.

srowen · ‎09-30-2015

It's possible to just use a static Executor in your code and use it to run multi-threaded operations within each function call. This may not be efficient though. If your goal is simply full utilization of cores, then make sure you have enough executors with enough cores running to use all of your cluster. Then make sure your number of partitions is at least this large. Then each operation can be single-threaded.

Wilfred · ‎09-28-2015

With the --files option you put the file in your working directory on the executor. You are trying to point to the file using an absolute path which is not what files option does for you. Can you use just the name "rule2.xml" and not a path. When you read the documentation for the files. See the important note at the bottom of the page running on yarn. Also do not use the Resources.getResource() but just use a open of a java construct like: new FileInputStream("rule2.xml") or something like it. Wilfred

Dahn · ‎09-27-2015

Done! I reinstall a new version of Hive: hive 1.2.1. And the job is run well!

srowen · ‎09-19-2015

Replications is an HDFS-level configuration. It isn't something you configure from Spark, and you don't have to worry about it from Spark. AFAIK you set a global replication factor, but can set it per directory too. I think you want to pursue this via HDFS.

srowen · ‎09-17-2015

I suppose you can cluster term vectors in V S for this purpose, to discover related terms and thus topics. This is the type of problem where you might more usually use LDA. I know you're using Mahout, but if you ever consider using Spark, there's a chapter on exactly this in our book: http://shop.oreilly.com/product/0636920035091.do

alrocks · ‎09-14-2015

In your Spark UI do you see it working with a large number of partitions (large number of tasks)? It could be that you are loading all 70G into memory at once if you have a small number of partitions. Also it could be that you have one huge partition with 99% of the data and lots of small ones. Then when Spark processes your huge partition it will load it all into memory. This can happen if you are mapping to a tuple e.g. (x, y) and the key (x) is the same for 99% of the data. Have a look at your Spark UI to see the size of the tasks you are running. It's likely that you will see a small number of tasks, or one huge task and a lot of small ones.

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Benchmark Cloudera, hortonworks and MapR

Re: SparkStreaming - ExitCodeException exitCode=13

Re: ERROR RDD transformations and actions can only...

Re: Spark : File not found error .... works fine i...

Re: Error when using HiveContext: java.lang.NoSuch...

Re: Write file to HDFS: limit number of datanodes ...

Re: Understanding the mahout SSVD output!

Re: Spark not working when I'm using a big dataset