Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Solved Go to solution

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Master Collaborator

I reproduced this. Everything else works fine, you can see the model generates a MAP of about 0.15 on Hadoop. It's just the last step where it seems to incorrectly decide the rank is insufficient. There is always a bit of heuristic here; nothing is ever literally "singular" due to machine precision. So it could be a case of loosening or improving the heuristic. I'll have to debug a little more.

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

Hi

 

Thank you so much for looking into this!

If it's of any help with the debugging, I managed to reproduce this with an even smaller subset of data and I have uploaded the file to the drive. And that the X and Y, before getting deleted, are indeed made up of a lot smaller numbers when run with Hadoop.

 

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Master Collaborator

Is this data synthetically generated? I'm also wondering if somehow it really does have rank less than about 6. That's possible for large data sets if they were algorithmically generated.

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

It is not, the last smallest dataset is a six month trace of user clicks through the system (well a quarter of searches, just to have a smaller and faster test).

It does contain just the users with at least 11 different product searches though, because, when adding users with fewer clicks, datasets get a lot bigger and I'm getting the JVM exception where too much time is spent garbage collecting. 

 

I have a bunch of others that I have tried, withouth this constraint on the min number of searched items, e.g. a one month trace of all users and all their clicks, that one again does not work with Hadoop (but in memory had a ~0.12 MAP) .

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

I have uploaded to Google drive the one which traces all user clicks, including the one time visitors, if it's of any help. 

Highlighted

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

@srowen wrote:

I'm also wondering if somehow it really does have rank less than about 6. That's possible for large data sets if they were algorithmically generated.


Say that were true, I should get the insufficient rank at least once when running the in memory computation, isn't it? Given I'm chasing this for a few days now, I ran the in memory things over and over again and it did not happen.

 

If X an Y folders do contain the factors are they really are, then the factors matrices from Hadoop, for some reason, contain way more very small numbers than the factor ones in memory (25000 as opposed to 76, on the 11_quarter.csv dataset), so maybe they are not singular, but the determinant is under SINGULARITY_THRESHOLD simply because the numbers are smaller? If I may, maybe serializing/deserializing very small floats is somehow bust and they get passed smaller and smaller from one generation to the next?

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

@christinan wrote:

If X an Y folders do contain the factors are they really are, then the factors matrices from Hadoop, for some reason, contain way more very small numbers than the factor ones in memory (25000 as opposed to 76, on the 11_quarter.csv dataset), so maybe they are not singular, but the determinant is under SINGULARITY_THRESHOLD simply because the numbers are smaller? If I may, maybe serializing/deserializing very small floats is somehow bust and they get passed smaller and smaller from one generation to the next?


Sorry, maybe I spoke to soon. For Hadoop, it looks like the first column is close to 0 for both X and Y so...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Master Collaborator

You are right that it's unlikely that the earlier computations would work if the data was low rank. OK synthetic data is ruled out.

It's not quite X or Y that is singular or nonsingular, it's X'*X and Y'*Y. Small absolute values in the matrices are normal.

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Master Collaborator

Strange, I do indeed get much different answers on the Hadoop version and they don't look quite right. The first row and column are very small and there's no good reason for that. I'll keep digging in to see where things to funny. The fact that MAP is good suggests that the model is good during iteration but something funny happens at the end.

Re: Oryx ALS: Hadoop computation yields MAP 0.00x, but in memory produces 0.11

Explorer

Yes, exactly, and it happens too deterministically to be random, so to speak. On my end I've tried all sorts of things in an attempt to corner the cause and take as many things out of the equation as possible, I disabled compression in mapreduce-site.xml, I changed the input data to use quotes, or the ratings from x, integers, to x.01, LF  to CRLF, UTF-8 with BOM or without it etc. Changed even the dataset itself, by taking different slices of the data, and quite different, the one where I include just users with more searches looks a lot different than the one with all users with all their searches. Also tried reducing factors. None of this had any effect.

 

But on the other hand, Movielens and Audio Scrobbler work fine, so a weird one... I'm very grateful you're investigating this, I'm certainly running out of things to try.

 

The value 0.15 that you see in the last iteration is what I usually get with the local computation for the "11" datasets. 

 

Don't have an account?
Coming from Hortonworks? Activate your account here