Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

avatar
Explorer

Hi 

 

I have been running Oryx ALS with the same entry dataset, both with local computation and with Hadoop. In memory produces MAP around 0.11, and converges after more than 25 iterations. Ran this about 20 times. With Hadoop, same dataset, same parameters, the algorithm converges at iteration 2 and MAP is 0.00x (ran it 3 times and wiped out the previous computations). 

 

With Hadoop computations I get this message: 

 

Fri Nov 28 15:43:09 GMT 2014 INFO Loading X and Y to test whether they have sufficient rank
Fri Nov 28 15:43:14 GMT 2014 INFO Matrix is not yet proved to be non-singular, continuing to load...
Fri Nov 28 15:43:14 GMT 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results

 

 

Any hints, please?

 

Thank you.  

1 ACCEPTED SOLUTION

avatar
Master Collaborator

The implementations are entirely separate although they do the same thing at a high level. Here the sampling process is different enough that it made a difference only in one place, even though both are sampling the same things.

 

This distributed/non-distributed distinction is historical; there are really two codebases here. This won't be carried forward in newer versions.

View solution in original post

34 REPLIES 34

avatar
Master Collaborator

Although the result can vary a bit randomly from run to run, and it's possible you're on the border of insufficient rank, it sounds like this happens consistently?

 

Are there any errors from the Hadoop workers? do X/ and Y/ contain data? It sounds like the process has stopped too early.

I suppose double-check that you do have the same data on HDFS. The config is otherwise the same?

avatar
Explorer

@srowen wrote:

Although the result can vary a bit randomly from run to run, and it's possible you're on the border of insufficient rank, it sounds like this happens consistently?

 

 


This happens consistently, yes. With those particular parameters, I always get ~0.11 in memory, and always between 0.006 and 0.009 with Hadoop. The config files are the same, I just commented out the  local-computation and local-data lines. The dataset is quite big, the file itself has 84MB and 3.6 million lines. 

Also, for the in memory computations I wrote a tool to automate the search for factors, lambdas and alphas so I have quite a lot of runs so far, and just one performed as bad as this ones with Hadoop. And never for these parameters. 

 


@srowen wrote:

 

Are there any errors from the Hadoop workers? do X/ and Y/ contain data? It sounds like the process has stopped too early.

I suppose double-check that you do have the same data on HDFS. The config is otherwise the same?


I have checked the computation layer log - in the console where I launched it, and the Hadoop job log. There were no errors anywhere. I do have a warning in the console, unable to find hadoop native libraries, using in built Java classes (I've Googled for a fix and will attend to that at some point).  

 

X and Y do not contain any data, as they get deleted, according to the computation log:

 

 

Fri Nov 28 15:43:09 GMT 2014 INFO Loading X and Y to test whether they have sufficient rank
Fri Nov 28 15:43:14 GMT 2014 INFO Matrix is not yet proved to be non-singular, continuing to load...
Fri Nov 28 15:43:14 GMT 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results

 

Apart from missing these two folders, the rest of the artifacts get generated. I have compared the known items file from an in memory run with a Hadoop run, the in memory has around 2000 extra lines and, at a first glance, not so many negative user ids. But the user ids get generated each time by the algorithm (meaning internally these new numbers replace my user ids), so I should not expect them to be the same, is that correct?  
 

avatar
Master Collaborator

You can ignore the native libraries message. It doesn't affect anything.

Right, X and Y are deleted after. It may be hard to view them before that happens.

The hash from IDs to ints is a consistent one, so the same string will always map to the same ID.

 

Something funny is going on here and it's probably subtle but simple, like an issue with how the data is read. Your comment about the IDs kind of suggests that the data files aren't being read as intended, so maybe all of these IDs are being treated quite differently as if they are unrelated. That could somehow explain poor performance and virtually 0 rank -- which should all but impossible with so much data and a reasonable default rank of <100.

 

Is it possible to send me a link to the data privately, and your config? I can take a look locally.

avatar
Explorer

Hi. Thank you and apologies for my delay in replying, I am being shared between projects... 

 

I have tried a few other datasets, smaller ones, and the issue is present for them as well. For this last small one, MAP in memory is ~0.14, with Hadoop is 0.06. It does look like something is wrong with my Hadoop installation, however I can't figure out why, the steps are quite simple.  


@srowen wrote:

You can ignore the native libraries message. It doesn't affect anything.

Right, X and Y are deleted after. It may be hard to view them before that happens.

The hash from IDs to ints is a consistent one, so the same string will always map to the same ID.

 


Just a side and low priority question here, why do the user ids get generated, but the items ids don't? My understanding was that the entry data constraints are: user ids should be unique long numbers, the item ids strings, ratings floats, so this made me think the original user ids can be reused, but item ids have to be generated. 

 


@srowen wrote:

 

Something funny is going on here and it's probably subtle but simple, like an issue with how the data is read. Your comment about the IDs kind of suggests that the data files aren't being read as intended, so maybe all of these IDs are being treated quite differently as if they are unrelated. 


I've redone the Snappy install too, just in case I missed something the first time. Was thinking perhaps the compression is done with a version and decompression with a different version, hence the "data read" issue, so is snappy a dependency of Oryx and perhaps I need to rebuild it with this version I have installed in Hadoop?

 


@srowen wrote:

Right, X and Y are deleted after. It may be hard to view them before that happens.

 

Is it possible to send me a link to the data privately, and your config? I can take a look locally.


I have changed Oryx's source code so that it does not wipe out X and Y even when the matrix is subpar (commented out some lines in ALSDistributedGenerationRunner).

 

I have uploaded the entry data and run results here: https://drive.google.com/folderview?id=0Bwd5INm6b7z4MENMcWtmQkNHRHM&usp=sharing . I am not concerned about privacy issues as the data is already anonymized, those ids don't really mean anything. 

I have created 2 folders, one with the in memory computation, the other for Hadoop, both computations for the same dataset.  

 

A few questions:

- my user ids are Int64, so 64 bit signed integers. Could this cause problems? My next on the list is to rearrange them and start from 1.

- the results I have included are for a test fraction of 0.25, so the output files will differ a lot due to random splitting (I imagine). Would it be easier for you if I run the computations without test fractions?

- would it be even easier if I run the computations with RandomUtils.useTestSeed()? And I'll have to instruct the reducers to do this too?

 

Thank you again for willing to look at this! 

 

avatar
Explorer

... I looked better at Oryx source code and it does seem to depend on Snappy 1.0.4.1. The version I have installed in Hadoop is  1.1.2, could this be an issue with one party compressing with a version and the other decompressing with a different one? If they added breaking changes (though I expect a jump to 2.x.x)? 

avatar
Explorer

Hi Sean

 

I have made more experiments and I think the problem somewhere in Oryx's code and not with my Hadoop installation. Because:

 

I booted up the latest version of Cloudera Quickstart VM and run:

- in memory computation => MAP 0.15

- Hadoop computation => insufficient rank

 

Ran a Hadoop computation of the audioscrobbler dataset, on my installation of Hadoop this time, and this produced X and Y with sufficient rank.

 

So, to conclude, there seems to be an issue strictly related to the format of my files... any hints? File is UTF-8 encoded, line endings are LF... will try rebuilding the user ids (they are currently very large numbers).

 

 

 

avatar
Master Collaborator

What do you mean that item IDs don't get generated? User and item IDs can be strings.

 

Snappy is required, but no particular version is. I don't know of any bugs in Snappy. It does not depend on Snappy directly but simply requires that Hadoop have Snappy codecs available. However it does end up embedding the hadoop-client directly to access HDFS, and maybe there is possibility of a version mismatch here.  

 

Did you build the binary to match your version of Hadoop? that's the safest thing. What are you using?

 

IDs don't matter. If they are strings representing long values they are used directly (i.e. "123" hashes to 123).

Random splitting does change results a bit from run to run but shouldn't result in a consistent difference.

 

OK, I'll try to get time to try it myself.

avatar
Explorer

@sowen wrote:

Did you build the binary to match your version of Hadoop? that's the safest thing. What are you using?


pom.xml in Oryx says it builds for 2.5.1, and that's exactly what I have on my machine (<hadoop.version>2.5.1</hadoop.version> in pom).

Nevertheless, I just ran this: mvn install -Dhadoop.version=2.5.1 now (I did not specify the version before, presumed the one from pom file is used). Also left the tests to be run, all passed.

 

After that things performed exactly as before, I am afraid, where X and Y do not have sufficient rank (and all fine with local computation). 

 

 

The Cloudera Quick Start VM, where I ran some other tests, has 2.5.0, so there was a mismatch there, and I will build Oryx for 2.5.0 and retry there as well.

 


@sowen wrote:

 

IDs don't matter. If they are strings representing long values they are used directly (i.e. "123" hashes to 123).

 


Indeed, I have now changed the ids to more sensible numbers, but that made no difference. 

 


@sowen wrote:

 

OK, I'll try to get time to try it myself.


Thank you for that. If there is anything I can do to help, I will be online and reachable by email at christina dot androne at gmail.

 

 

avatar
Explorer

Hi Sean

 

If I may summarize my findings so far, hopefully they will help pinpointing the problem (tried a few other things since yesterday).

 

So far I ruled out the followings:

- faulty Hadoop and Snappy installation on my machine: the same behavior happens on the Cloudera Quick Start VM, also the AudioScrobbler computation with Hadoop works fine on both systems (by that I mean my machine and the VM, I'll call it CVM);

- mismatch between Oryx embbeded hadoop client version and installed version of Hadoop. I have built Oryx with the corresponding versions and the issue is still present (my machine has 2.5.1, the CVM has 2.5.0)

- data being on the border of insufficient rank: in memory computation always produces X and Y with sufficient rank, Hadoop computation always produces the opposite. Given how many trials I ran, I'd expect the situation to be reversed at least once.

- faulty build of Oryx. I ran some computations using the 1.0.0 jar from the Releases page on Github, but again no improvements.

- reducer memory issues. I tried a few runs with computation-layer.worker-high-memory-factor=5, same thing.

- test-set-fraction issues that might come up just with Hadoop: I have the same faulty behavior when I don't set a test set fraction

- data size issues: ran some tests with an even more reduced version of my dataset, a bit smaller than the AudioScrobbler one. No improvements, I am afraid.

 

Based on this, I can only conclude there is something faulty with the file itself and how the data is structured (as AudioScrobbler works fine), what do you think? Any hints on what to do next?

 

Thank you.