Member since
07-29-2013
366
Posts
69
Kudos Received
71
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5137 | 03-09-2016 01:21 AM | |
4330 | 03-07-2016 01:52 AM | |
13643 | 02-29-2016 04:40 AM | |
4056 | 02-22-2016 03:08 PM | |
5065 | 01-19-2016 02:13 PM |
12-05-2014
06:12 AM
I reproduced this. Everything else works fine, you can see the model generates a MAP of about 0.15 on Hadoop. It's just the last step where it seems to incorrectly decide the rank is insufficient. There is always a bit of heuristic here; nothing is ever literally "singular" due to machine precision. So it could be a case of loosening or improving the heuristic. I'll have to debug a little more.
... View more
12-05-2014
06:10 AM
That's a weird one indeed. NoSuchMethodError generally means you have built your app against a different version of a library than you run against. CDH 5.2 contains Spark 1.1.0 plus a few critical upstream fixes. None of those changes should ever affect source or binary compatibility. You can always build against the exact CDH 5.2 artifacts anyway, but it shouldn't matter. 1.1.0, 1.1.1, is all the same from an API perspective. But, here, there should be no incompatibility at all anyway, since you should not be bundling Spark libs (or Hadoop libs) with your app. Are you marking them as 'provided' dependencies?
... View more
12-05-2014
05:22 AM
1 Kudo
That's perhaps too broad to answer here. Generally, any algorithm that is data-parallel will do well on Spark (or indeed, on MapReduce). And ones that aren't data-parallel do not. I am not familiar with any of those algorithms, but that's the question to answer.
... View more
11-28-2014
10:16 AM
You can ignore the native libraries message. It doesn't affect anything. Right, X and Y are deleted after. It may be hard to view them before that happens. The hash from IDs to ints is a consistent one, so the same string will always map to the same ID. Something funny is going on here and it's probably subtle but simple, like an issue with how the data is read. Your comment about the IDs kind of suggests that the data files aren't being read as intended, so maybe all of these IDs are being treated quite differently as if they are unrelated. That could somehow explain poor performance and virtually 0 rank -- which should all but impossible with so much data and a reasonable default rank of <100. Is it possible to send me a link to the data privately, and your config? I can take a look locally.
... View more
11-28-2014
08:36 AM
Although the result can vary a bit randomly from run to run, and it's possible you're on the border of insufficient rank, it sounds like this happens consistently? Are there any errors from the Hadoop workers? do X/ and Y/ contain data? It sounds like the process has stopped too early. I suppose double-check that you do have the same data on HDFS. The config is otherwise the same?
... View more
11-28-2014
08:27 AM
Hadoop still has config files for sure. They can end up wherever you want them to. I though they're still at $HADOOP_HOME/conf in the vanilla Hadoop tarball, but I took a look at 2.5.2 and it's at$HADOOP_HOME/etc/hadoop in fact. In any event if they're at /usr/local/hadoop/etc/hadoop in your installation, then that's what you set $HADOOP_CONF_DIR to. Just wherever they really are. This is one of Hadoop's standard environment variables. If you're up and running then this is working. Yes that sounds like about what you do to install snappy. They are libs that should be present on the cluster machines.
... View more
11-27-2014
11:48 AM
First you need to figure out where your Hadoop config files are -- core-site.xml, etc. If you unpacked things in /usr/local/hadoop, then it's almost surely /usr/local/hadoop/conf. You have "etc" in your path but shouldn't, and that's the actual problem. You don't need to set all these environment variables, just "export HADOOP_CONF_DIR=..." in your shell. You don't need to modify any scripts. hadoop-env.sh won't do anything. Have you installed Snappy? you will need Snappy. I don't know if plain vanilla Apache Hadoop is able to configure and install it for you, although it's part of Hadoop. It's much easier to use a distribution, but your second problem appears to be down to not having Snappy set up.
... View more
11-27-2014
08:47 AM
Caused by: java.lang.IllegalStateException: Not a directory: /etc/hadoop/conf Is HADOOP_CONF_DIR set and set to /usr/local/hadoop ? that's what it's complaining about, that it can't find Hadoop config in a default location.
... View more
11-24-2014
02:37 PM
1 Kudo
This is an option to spark-submit or pyspark. Look at the Spark docs.
... View more
11-24-2014
02:01 AM
Spark defaults to run with a local master IIRC. You should set "--master yarn-client" to actually use YARN. I assume it's not different for pyspark vs spark-shell.
... View more