Support Questions

Find answers, ask questions, and share your expertise

Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

Hi 

 

I have been running Oryx ALS with the same entry dataset, both with local computation and with Hadoop. In memory produces MAP around 0.11, and converges after more than 25 iterations. Ran this about 20 times. With Hadoop, same dataset, same parameters, the algorithm converges at iteration 2 and MAP is 0.00x (ran it 3 times and wiped out the previous computations). 

 

With Hadoop computations I get this message: 

 

Fri Nov 28 15:43:09 GMT 2014 INFO Loading X and Y to test whether they have sufficient rank
Fri Nov 28 15:43:14 GMT 2014 INFO Matrix is not yet proved to be non-singular, continuing to load...
Fri Nov 28 15:43:14 GMT 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results

 

 

Any hints, please?

 

Thank you.  

1 ACCEPTED SOLUTION

Master Collaborator

The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things.

 

This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.

View solution in original post

34 REPLIES 34

Master Collaborator

Although the result can vary a bit randomly from run to run, and it's possible you're on the border of insufficient rank, it sounds like this happens consistently?

 

Are there any errors from the Hadoop workers? do X/ and Y/ contain data? It sounds like the process has stopped too early.

I suppose double-check that you do have the same data on HDFS. The config is otherwise the same?

Explorer

@srowen wrote:

Although the result can vary a bit randomly from run to run, and it's possible you're on the border of insufficient rank, it sounds like this happens consistently?

 

 


This happens consistently, yes. With those particular parameters, I always get ~0.11 in memory, and always between 0.006 and 0.009 with Hadoop. The config files are the same, I just commented out the  local-computation and local-data lines. The dataset is quite big, the file itself has 84MB and 3.6 million lines. 

Also, for the in memory computations I wrote a tool to automate the search for factors, lambdas and alphas so I have quite a lot of runs so far, and just one performed as bad as this ones with Hadoop. And never for these parameters. 

 


@srowen wrote:

 

Are there any errors from the Hadoop workers? do X/ and Y/ contain data? It sounds like the process has stopped too early.

I suppose double-check that you do have the same data on HDFS. The config is otherwise the same?


I have checked the computation layer log - in the console where I launched it, and the Hadoop job log. There were no errors anywhere. I do have a warning in the console, unable to find hadoop native libraries, using in built Java classes (I've Googled for a fix and will attend to that at some point).  

 

X and Y do not contain any data, as they get deleted, according to the computation log:

 

 

Fri Nov 28 15:43:09 GMT 2014 INFO Loading X and Y to test whether they have sufficient rank
Fri Nov 28 15:43:14 GMT 2014 INFO Matrix is not yet proved to be non-singular, continuing to load...
Fri Nov 28 15:43:14 GMT 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results

 

Apart from missing these two folders, the rest of the artifacts get generated. I have compared the known items file from an in memory run with a Hadoop run, the in memory has around 2000 extra lines and, at a first glance, not so many negative user ids. But the user ids get generated each time by the algorithm (meaning internally these new numbers replace my user ids), so I should not expect them to be the same, is that correct?  
 

Master Collaborator

You can ignore the native libraries message. It doesn't affect anything.

Right, X and Y are deleted after. It may be hard to view them before that happens.

The hash from IDs to ints is a consistent one, so the same string will always map to the same ID.

 

Something funny is going on here and it's probably subtle but simple, like an issue with how the data is read. Your comment about the IDs kind of suggests that the data files aren't being read as intended, so maybe all of these IDs are being treated quite differently as if they are unrelated. That could somehow explain poor performance and virtually 0 rank -- which should all but impossible with so much data and a reasonable default rank of <100.

 

Is it possible to send me a link to the data privately, and your config? I can take a look locally.

Explorer

Hi. Thank you and apologies for my delay in replying, I am being shared between projects... 

 

I have tried a few other datasets, smaller ones, and the issue is present for them as well. For this last small one, MAP in memory is ~0.14, with Hadoop is 0.06. It does look like something is wrong with my Hadoop installation, however I can't figure out why, the steps are quite simple.  


@srowen wrote:

You can ignore the native libraries message. It doesn't affect anything.

Right, X and Y are deleted after. It may be hard to view them before that happens.

The hash from IDs to ints is a consistent one, so the same string will always map to the same ID.

 


Just a side and low priority question here, why do the user ids get generated, but the items ids don't? My understanding was that the entry data constraints are: user ids should be unique long numbers, the item ids strings, ratings floats, so this made me think the original user ids can be reused, but item ids have to be generated. 

 


@srowen wrote:

 

Something funny is going on here and it's probably subtle but simple, like an issue with how the data is read. Your comment about the IDs kind of suggests that the data files aren't being read as intended, so maybe all of these IDs are being treated quite differently as if they are unrelated. 


I've redone the Snappy install too, just in case I missed something the first time. Was thinking perhaps the compression is done with a version and decompression with a different version, hence the "data read" issue, so is snappy a dependency of Oryx and perhaps I need to rebuild it with this version I have installed in Hadoop?

 


@srowen wrote:

Right, X and Y are deleted after. It may be hard to view them before that happens.

 

Is it possible to send me a link to the data privately, and your config? I can take a look locally.


I have changed Oryx's source code so that it does not wipe out X and Y even when the matrix is subpar (commented out some lines in ALSDistributedGenerationRunner).

 

I have uploaded the entry data and run results here: https://drive.google.com/folderview?id=0Bwd5INm6b7z4MENMcWtmQkNHRHM&usp=sharing . I am not concerned about privacy issues as the data is already anonymized, those ids don't really mean anything. 

I have created 2 folders, one with the in memory computation, the other for Hadoop, both computations for the same dataset.  

 

A few questions:

- my user ids are Int64, so 64 bit signed integers. Could this cause problems? My next on the list is to rearrange them and start from 1.

- the results I have included are for a test fraction of 0.25, so the output files will differ a lot due to random splitting (I imagine). Would it be easier for you if I run the computations without test fractions?

- would it be even easier if I run the computations with RandomUtils.useTestSeed()? And I'll have to instruct the reducers to do this too?

 

Thank you again for willing to look at this! 

 

Explorer

... I looked better at Oryx source code and it does seem to depend on Snappy 1.0.4.1. The version I have installed in Hadoop is  1.1.2, could this be an issue with one party compressing with a version and the other decompressing with a different one? If they added breaking changes (though I expect a jump to 2.x.x)? 

Explorer

Hi Sean

 

I have made more experiments and I think the problem somewhere in Oryx's code and not with my Hadoop installation. Because:

 

I booted up the latest version of Cloudera Quickstart VM and run:

- in memory computation => MAP 0.15

- Hadoop computation => insufficient rank

 

Ran a Hadoop computation of the audioscrobbler dataset, on my installation of Hadoop this time, and this produced X and Y with sufficient rank.

 

So, to conclude, there seems to be an issue strictly related to the format of my files... any hints? File is UTF-8 encoded, line endings are LF... will try rebuilding the user ids (they are currently very large numbers).

 

 

 

Master Collaborator

What do you mean that item IDs don't get generated? User and item IDs can be strings.

 

Snappy is required, but no particular version is. I don't know of any bugs in Snappy. It does not depend on Snappy directly but simply requires that Hadoop have Snappy codecs available. However it does end up embedding the hadoop-client directly to access HDFS, and maybe there is possibility of a version mismatch here.  

 

Did you build the binary to match your version of Hadoop? that's the safest thing. What are you using?

 

IDs don't matter. If they are strings representing long values they are used directly (i.e. "123" hashes to 123).

Random splitting does change results a bit from run to run but shouldn't result in a consistent difference.

 

OK, I'll try to get time to try it myself.

Explorer

@sowen wrote:

Did you build the binary to match your version of Hadoop? that's the safest thing. What are you using?


pom.xml in Oryx says it builds for 2.5.1, and that's exactly what I have on my machine (<hadoop.version>2.5.1</hadoop.version> in pom).

Nevertheless, I just ran this: mvn install -Dhadoop.version=2.5.1 now (I did not specify the version before, presumed the one from pom file is used). Also left the tests to be run, all passed.

 

After that things performed exactly as before, I am afraid, where X and Y do not have sufficient rank (and all fine with local computation). 

 

 

The Cloudera Quick Start VM, where I ran some other tests, has 2.5.0, so there was a mismatch there, and I will build Oryx for 2.5.0 and retry there as well.

 


@sowen wrote:

 

IDs don't matter. If they are strings representing long values they are used directly (i.e. "123" hashes to 123).

 


Indeed, I have now changed the ids to more sensible numbers, but that made no difference. 

 


@sowen wrote:

 

OK, I'll try to get time to try it myself.


Thank you for that. If there is anything I can do to help, I will be online and reachable by email at christina dot androne at gmail.

 

 

Explorer

Hi Sean

 

If I may summarize my findings so far, hopefully they will help pinpointing the problem (tried a few other things since yesterday).

 

So far I ruled out the followings:

- faulty Hadoop and Snappy installation on my machine: the same behavior happens on the Cloudera Quick Start VM, also the AudioScrobbler computation with Hadoop works fine on both systems (by that I mean my machine and the VM, I'll call it CVM);

- mismatch between Oryx embbeded hadoop client version and installed version of Hadoop. I have built Oryx with the corresponding versions and the issue is still present (my machine has 2.5.1, the CVM has 2.5.0)

- data being on the border of insufficient rank: in memory computation always produces X and Y with sufficient rank, Hadoop computation always produces the opposite. Given how many trials I ran, I'd expect the situation to be reversed at least once.

- faulty build of Oryx. I ran some computations using the 1.0.0 jar from the Releases page on Github, but again no improvements.

- reducer memory issues. I tried a few runs with computation-layer.worker-high-memory-factor=5, same thing.

- test-set-fraction issues that might come up just with Hadoop: I have the same faulty behavior when I don't set a test set fraction

- data size issues: ran some tests with an even more reduced version of my dataset, a bit smaller than the AudioScrobbler one. No improvements, I am afraid.

 

Based on this, I can only conclude there is something faulty with the file itself and how the data is structured (as AudioScrobbler works fine), what do you think? Any hints on what to do next?

 

Thank you.

 

 

 

Master Collaborator

I reproduced this. Everything else works fine, you can see the model generates a MAP of about 0.15 on Hadoop. It's just the last step where it seems to incorrectly decide the rank is insufficient. There is always a bit of heuristic here; nothing is ever literally "singular" due to machine precision. So it could be a case of loosening or improving the heuristic. I'll have to debug a little more.

Explorer

Hi

 

Thank you so much for looking into this!

If it's of any help with the debugging, I managed to reproduce this with an even smaller subset of data and I have uploaded the file to the drive. And that the X and Y, before getting deleted, are indeed made up of a lot smaller numbers when run with Hadoop.

 

Master Collaborator

Is this data synthetically generated? I'm also wondering if somehow it really does have rank less than about 6. That's possible for large data sets if they were algorithmically generated.

Explorer

It is not, the last smallest dataset is a six month trace of user clicks through the system (well a quarter of searches, just to have a smaller and faster test).

It does contain just the users with at least 11 different product searches though, because, when adding users with fewer clicks, datasets get a lot bigger and I'm getting the JVM exception where too much time is spent garbage collecting. 

 

I have a bunch of others that I have tried, withouth this constraint on the min number of searched items, e.g. a one month trace of all users and all their clicks, that one again does not work with Hadoop (but in memory had a ~0.12 MAP) .

Explorer

I have uploaded to Google drive the one which traces all user clicks, including the one time visitors, if it's of any help. 

Explorer

@srowen wrote:

I'm also wondering if somehow it really does have rank less than about 6. That's possible for large data sets if they were algorithmically generated.


Say that were true, I should get the insufficient rank at least once when running the in memory computation, isn't it? Given I'm chasing this for a few days now, I ran the in memory things over and over again and it did not happen.

 

If X an Y folders do contain the factors are they really are, then the factors matrices from Hadoop, for some reason, contain way more very small numbers than the factor ones in memory (25000 as opposed to 76, on the 11_quarter.csv dataset), so maybe they are not singular, but the determinant is under SINGULARITY_THRESHOLD simply because the numbers are smaller? If I may, maybe serializing/deserializing very small floats is somehow bust and they get passed smaller and smaller from one generation to the next?

Explorer

@christinan wrote:

If X an Y folders do contain the factors are they really are, then the factors matrices from Hadoop, for some reason, contain way more very small numbers than the factor ones in memory (25000 as opposed to 76, on the 11_quarter.csv dataset), so maybe they are not singular, but the determinant is under SINGULARITY_THRESHOLD simply because the numbers are smaller? If I may, maybe serializing/deserializing very small floats is somehow bust and they get passed smaller and smaller from one generation to the next?


Sorry, maybe I spoke to soon. For Hadoop, it looks like the first column is close to 0 for both X and Y so...

Master Collaborator

You are right that it's unlikely that the earlier computations would work if the data was low rank. OK synthetic data is ruled out.

It's not quite X or Y that is singular or nonsingular, it's X'*X and Y'*Y. Small absolute values in the matrices are normal.

Master Collaborator

Strange, I do indeed get much different answers on the Hadoop version and they don't look quite right. The first row and column are very small and there's no good reason for that. I'll keep digging in to see where things to funny. The fact that MAP is good suggests that the model is good during iteration but something funny happens at the end.

Explorer

Yes, exactly, and it happens too deterministically to be random, so to speak. On my end I've tried all sorts of things in an attempt to corner the cause and take as many things out of the equation as possible, I disabled compression in mapreduce-site.xml, I changed the input data to use quotes, or the ratings from x, integers, to x.01, LF  to CRLF, UTF-8 with BOM or without it etc. Changed even the dataset itself, by taking different slices of the data, and quite different, the one where I include just users with more searches looks a lot different than the one with all users with all their searches. Also tried reducing factors. None of this had any effect.

 

But on the other hand, Movielens and Audio Scrobbler work fine, so a weird one... I'm very grateful you're investigating this, I'm certainly running out of things to try.

 

The value 0.15 that you see in the last iteration is what I usually get with the local computation for the "11" datasets. 

 

Master Collaborator

I'm certain it's nothing to do with the input itself. It looks fine and those types of problem would be different.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.