About srowen

srowen · ‎12-05-2014

I reproduced this. Everything else works fine, you can see the model generates a MAP of about 0.15 on Hadoop. It's just the last step where it seems to incorrectly decide the rank is insufficient. There is always a bit of heuristic here; nothing is ever literally "singular" due to machine precision. So it could be a case of loosening or improving the heuristic. I'll have to debug a little more.

srowen · ‎12-05-2014

That's a weird one indeed. NoSuchMethodError generally means you have built your app against a different version of a library than you run against. CDH 5.2 contains Spark 1.1.0 plus a few critical upstream fixes. None of those changes should ever affect source or binary compatibility. You can always build against the exact CDH 5.2 artifacts anyway, but it shouldn't matter. 1.1.0, 1.1.1, is all the same from an API perspective. But, here, there should be no incompatibility at all anyway, since you should not be bundling Spark libs (or Hadoop libs) with your app. Are you marking them as 'provided' dependencies?

srowen · ‎12-05-2014

That's perhaps too broad to answer here. Generally, any algorithm that is data-parallel will do well on Spark (or indeed, on MapReduce). And ones that aren't data-parallel do not. I am not familiar with any of those algorithms, but that's the question to answer.

srowen · ‎11-28-2014

You can ignore the native libraries message. It doesn't affect anything. Right, X and Y are deleted after. It may be hard to view them before that happens. The hash from IDs to ints is a consistent one, so the same string will always map to the same ID. Something funny is going on here and it's probably subtle but simple, like an issue with how the data is read. Your comment about the IDs kind of suggests that the data files aren't being read as intended, so maybe all of these IDs are being treated quite differently as if they are unrelated. That could somehow explain poor performance and virtually 0 rank -- which should all but impossible with so much data and a reasonable default rank of <100. Is it possible to send me a link to the data privately, and your config? I can take a look locally.

srowen · ‎11-28-2014

Although the result can vary a bit randomly from run to run, and it's possible you're on the border of insufficient rank, it sounds like this happens consistently? Are there any errors from the Hadoop workers? do X/ and Y/ contain data? It sounds like the process has stopped too early. I suppose double-check that you do have the same data on HDFS. The config is otherwise the same?

srowen · ‎11-28-2014

Hadoop still has config files for sure. They can end up wherever you want them to. I though they're still at $HADOOP_HOME/conf in the vanilla Hadoop tarball, but I took a look at 2.5.2 and it's at$HADOOP_HOME/etc/hadoop in fact. In any event if they're at /usr/local/hadoop/etc/hadoop in your installation, then that's what you set $HADOOP_CONF_DIR to. Just wherever they really are. This is one of Hadoop's standard environment variables. If you're up and running then this is working. Yes that sounds like about what you do to install snappy. They are libs that should be present on the cluster machines.

srowen · ‎11-27-2014

First you need to figure out where your Hadoop config files are -- core-site.xml, etc. If you unpacked things in /usr/local/hadoop, then it's almost surely /usr/local/hadoop/conf. You have "etc" in your path but shouldn't, and that's the actual problem. You don't need to set all these environment variables, just "export HADOOP_CONF_DIR=..." in your shell. You don't need to modify any scripts. hadoop-env.sh won't do anything. Have you installed Snappy? you will need Snappy. I don't know if plain vanilla Apache Hadoop is able to configure and install it for you, although it's part of Hadoop. It's much easier to use a distribution, but your second problem appears to be down to not having Snappy set up.

srowen · ‎11-27-2014

Caused by: java.lang.IllegalStateException: Not a directory: /etc/hadoop/conf Is HADOOP_CONF_DIR set and set to /usr/local/hadoop ? that's what it's complaining about, that it can't find Hadoop config in a default location.

srowen · ‎11-24-2014

This is an option to spark-submit or pyspark. Look at the Spark docs.

srowen · ‎11-24-2014

Spark defaults to run with a local master IIRC. You should set "--master yarn-client" to actually use YARN. I assume it's not different for pyspark vs spark-shell.

Online	Offline
Last Visited	‎02-06-2015 02:06 PM

Member Since	‎07-29-2013 08:58 AM
Last Visited	‎02-06-2015 02:06 PM
Posts	366
Kudos received	62

Cloudera Community

Re: CDH 5.6

Re: How to use Oryx 1 to detect spam email

Re: Spark program in eclipse

Re: Graphx in latest CDH

Re: Maturity ORYX

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: NoSuchMethodError when submitting Spark jobs w...

Re: What kind of algorithm can be written with spa...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Unable to run Oryx with Hadoop, exception java...

Re: Unable to run Oryx with Hadoop, exception java...

Re: Unable to run Oryx with Hadoop, exception java...

Re: No Completed Application Found in Spark Histor...

Re: No Completed Application Found in Spark Histor...