About christinan

christinan · ‎12-07-2014

No, the other dataset - I uploaded it to Google drive a few days back, it's called cleaned_taste_preferences_rated_last_month.csv. It's a one month trace of production searches; when I'll go live with the recommender, I will train it on a similar dataset, but covering six months. The ones I used to illustrate the initial problem with the rank, with names containing _11, contain just users with at least 11 different searches, and I used it just for development speed, as it is way smaller and will most likely converge.

christinan · ‎12-07-2014

I think something is still wrong. When using cleaned_taste_preferences_rated_last_month.csv, Hadoop will converge at 2, with a MAP of 0.00x. In memory converges a lot later, and MAP is ~0.11. This happened every time. Do you see the same thing?

christinan · ‎12-07-2014

Thank you so much for fixing this. Ran a Hadoop and local computation and they were both fine this time, the model is preserved, and the MAP estimate for the last iteration is comparable to the local computation one (slightly smaller, presume due to randomness). I will continue to run a few more tests. I think there is still a slight mismatch, as with Hadoop none of the datasets converged later than iteration 8, but in memory takes a lot more to converge; for the bigger datasets it needs close to 50. But I will repeat all my tests with the new version and report back.

christinan · ‎12-05-2014

Yes, exactly, and it happens too deterministically to be random, so to speak. On my end I've tried all sorts of things in an attempt to corner the cause and take as many things out of the equation as possible, I disabled compression in mapreduce-site.xml, I changed the input data to use quotes, or the ratings from x, integers, to x.01, LF to CRLF, UTF-8 with BOM or without it etc. Changed even the dataset itself, by taking different slices of the data, and quite different, the one where I include just users with more searches looks a lot different than the one with all users with all their searches. Also tried reducing factors. None of this had any effect. But on the other hand, Movielens and Audio Scrobbler work fine, so a weird one... I'm very grateful you're investigating this, I'm certainly running out of things to try. The value 0.15 that you see in the last iteration is what I usually get with the local computation for the "11" datasets.

christinan · ‎12-05-2014

@christinan wrote: If X an Y folders do contain the factors are they really are, then the factors matrices from Hadoop, for some reason, contain way more very small numbers than the factor ones in memory (25000 as opposed to 76, on the 11_quarter.csv dataset), so maybe they are not singular, but the determinant is under SINGULARITY_THRESHOLD simply because the numbers are smaller? If I may, maybe serializing/deserializing very small floats is somehow bust and they get passed smaller and smaller from one generation to the next? Sorry, maybe I spoke to soon. For Hadoop, it looks like the first column is close to 0 for both X and Y so...

christinan · ‎12-05-2014

@srowen wrote: I'm also wondering if somehow it really does have rank less than about 6. That's possible for large data sets if they were algorithmically generated. Say that were true, I should get the insufficient rank at least once when running the in memory computation, isn't it? Given I'm chasing this for a few days now, I ran the in memory things over and over again and it did not happen. If X an Y folders do contain the factors are they really are, then the factors matrices from Hadoop, for some reason, contain way more very small numbers than the factor ones in memory (25000 as opposed to 76, on the 11_quarter.csv dataset), so maybe they are not singular, but the determinant is under SINGULARITY_THRESHOLD simply because the numbers are smaller? If I may, maybe serializing/deserializing very small floats is somehow bust and they get passed smaller and smaller from one generation to the next?

christinan · ‎12-05-2014

I have uploaded to Google drive the one which traces all user clicks, including the one time visitors, if it's of any help.

christinan · ‎12-05-2014

It is not, the last smallest dataset is a six month trace of user clicks through the system (well a quarter of searches, just to have a smaller and faster test). It does contain just the users with at least 11 different product searches though, because, when adding users with fewer clicks, datasets get a lot bigger and I'm getting the JVM exception where too much time is spent garbage collecting. I have a bunch of others that I have tried, withouth this constraint on the min number of searched items, e.g. a one month trace of all users and all their clicks, that one again does not work with Hadoop (but in memory had a ~0.12 MAP) .

christinan · ‎12-05-2014

Hi Thank you so much for looking into this! If it's of any help with the debugging, I managed to reproduce this with an even smaller subset of data and I have uploaded the file to the drive. And that the X and Y, before getting deleted, are indeed made up of a lot smaller numbers when run with Hadoop.

christinan · ‎12-04-2014

Hi Sean If I may summarize my findings so far, hopefully they will help pinpointing the problem (tried a few other things since yesterday). So far I ruled out the followings: - faulty Hadoop and Snappy installation on my machine: the same behavior happens on the Cloudera Quick Start VM, also the AudioScrobbler computation with Hadoop works fine on both systems (by that I mean my machine and the VM, I'll call it CVM); - mismatch between Oryx embbeded hadoop client version and installed version of Hadoop. I have built Oryx with the corresponding versions and the issue is still present (my machine has 2.5.1, the CVM has 2.5.0) - data being on the border of insufficient rank: in memory computation always produces X and Y with sufficient rank, Hadoop computation always produces the opposite. Given how many trials I ran, I'd expect the situation to be reversed at least once. - faulty build of Oryx. I ran some computations using the 1.0.0 jar from the Releases page on Github, but again no improvements. - reducer memory issues. I tried a few runs with computation-layer.worker-high-memory-factor=5, same thing. - test-set-fraction issues that might come up just with Hadoop: I have the same faulty behavior when I don't set a test set fraction - data size issues: ran some tests with an even more reduced version of my dataset, a bit smaller than the AudioScrobbler one. No improvements, I am afraid. Based on this, I can only conclude there is something faulty with the file itself and how the data is structured (as AudioScrobbler works fine), what do you think? Any hints on what to do next? Thank you.

Online	Offline
Last Visited	‎04-01-2015 10:01 AM

Member Since	‎11-27-2014 06:43 AM
Last Visited	‎04-01-2015 10:01 AM
Posts	32

Cloudera Community

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...

Re: Oryx ALS: Hadoop computation yields MAP 0.00x,...