Created on 02-06-2014 10:17 AM - edited 09-16-2022 08:39 AM
A little info on the system this is running on:
I'm running CDH5 Beta1 on RHEL6U5, using the parcel installation method. I've set $JAVA_HOME to the cloudera installed 1.7_25 version. Oryx was downloaded from github and built from source, using the hadoop22 profile. The data source for the ALS job is on HDFS, not local.
I have a dataset containing 3,766,950 observations in User,Product,Strength format, which I am trying to use with the Oryx ALS collaborative filtering algorithm. Roughly 67.37% of the observations have a weight of 1. My problem is that when attempting to execute the ALS job, the results are that either X or Y does not have sufficient rank, and are thus deleted.
I've attempted running the Myrrix ParameterOptimizer using the following command (3 steps, 50% sample):
java -Xmx4g -cp myrrix-serving-1.0.1.jar net.myrrix.online.eval.ParameterOptimizer data 3 .5 model.features=10:150 model.als.lambda=0.0001:1
It recommended using {model.als.lambda=1, model.features=45}, which I then used in the configuration file.
The configuration file itself is very simple:
model=${als-model} model.instance-dir=/Oryx/data model.local-computation=false model.local-data=false model.features=45 model.lambda=1 serving-layer.api.port=8093 computation-layer.api.port=8094
And the computation command:
java -Dconfig.file=als.conf -jar computation/target/oryx-computation-0.4.0-SNAPSHOT.jar
After 20m or so of processing, this is the final few lines of output:
Thu Feb 06 12:49:08 EST 2014 INFO Loading X and Y to test whether they have sufficient rank Thu Feb 06 12:49:24 EST 2014 INFO Matrix is not yet proved to be non-singular, continuing to load... Thu Feb 06 12:49:24 EST 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results Thu Feb 06 12:49:24 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/X Thu Feb 06 12:49:24 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/Y Thu Feb 06 12:49:24 EST 2014 INFO Signaling completion of generation 0 Thu Feb 06 12:49:24 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/tmp Thu Feb 06 12:49:24 EST 2014 INFO Dumping some stats on generation 0 Thu Feb 06 12:49:24 EST 2014 INFO Generation 0 complete
Any ideas on why this isn't working with using the recommended Features count and Lambda? The ALS audioscrobbler example works fine, and the data format is similar (though the strengths are considerably smaller on my dataset).
Thanks in advance,
James
Created 02-08-2014 03:08 PM
Created 02-06-2014 10:34 AM
Hmm that does sound strange. lambda = 1 is on the high side, although it may have come out as the best value given the values tested in the optimizer and given the variation in random starting points, etc. There is some randomness on the other side too when it builds and tests the factorization.
My first guess is: decrease lambda. You might re-run the optimizer and restrict it to at most 0.1. This isn't a great answer but think it may be the fastest path to something working.
Longer-term, this is going to be rewritten to fully integrate parameter search into model building, so it won't be this separate and maybe disagreeing process.
Created 02-06-2014 11:33 AM
Hi Sean,
I should have mentioned that I've tried a few variations, each resulting in the same error each time. I've tried the following combinations so far, each with the same result as when I followed the recommended Feature/Lambda settings:
Features : Lambda
20 : 0.065
100 : 0.065
45 : 1
45 : 0.1
50 : 0.1
All of those combinations end with the following error:
Thu Feb 06 14:20:37 EST 2014 INFO Loading X and Y to test whether they have sufficient rank Thu Feb 06 14:20:50 EST 2014 INFO Matrix is not yet proved to be non-singular, continuing to load... Thu Feb 06 14:20:50 EST 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results Thu Feb 06 14:20:50 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/X Thu Feb 06 14:20:50 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/Y Thu Feb 06 14:20:50 EST 2014 INFO Signaling completion of generation 0 Thu Feb 06 14:20:50 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/tmp Thu Feb 06 14:20:50 EST 2014 INFO Dumping some stats on generation 0 Thu Feb 06 14:20:50 EST 2014 INFO Generation 0 complete
Created 02-06-2014 12:32 PM
Hmm. How much data are we talking about? you're building the model on the same data you optimized from?
How many unique users and items?
The general remedy is fewer features and lower lambda, but it can't be right that the optimizer is fine with this while the model build isn't fine with any of those values. Something is not right here...
Created 02-06-2014 01:50 PM
The model is indeed being built from the full dataset, while the optimization was performed against a 50% sample. To get the sample, I downloaded the dataset from HDFS to the local filesystem, and performed a "head -n 1883475 data50percent.csv". Then I ran the optimizer locally, not distributed. Should I use the full dataset instead?
Dataset size is 125MB
Number of records 3,766,950
Unique users 608146
Unique items 1151
Created 02-06-2014 03:32 PM
I would use 100% of the data, yes, but I don't think that should make a big difference
The number of items is low, but not that low. Is there any reason to think the items are very 'redundant'?
This is strange enough that I think there may be a bug somewhere. Is this data you can share, in anonymized form, offline?
I am wondering whether the singularity threshold needs to be configurable.
I would still try turning down lambda / features to get it going, although I still am not seeing a reason why it should be necessary.
Created 02-07-2014 05:08 AM
Some of the items are exceptionally popular, while a large number of the other items have very low values. The weight is a simple count of the items per user within a timeframe. So a userID/itemID combo should only be seen once, but some of those items are seen for a very large percentage of the userIDs.
I've tried setting Features to 5, and Lambda to .01, which also failed. I'll try setting Features to 3 and Lambda to .0001 and see if that has any effect.
I'll verify with our Legal dept about sending the data over, but it shouldn't be an issue. I know I have your card from when we met in London and NY Strata, but it's in my desk at work, and I'm working from home, so you might have to message me with your email address or data drop location.
Created 02-07-2014 11:18 AM
I'm cleared to send the dataset, just need to know where it's going!
James
Created 02-07-2014 12:09 PM
Thanks James, I'm sowen at cloudera. It sounds like something I will have to debug as it's either quite subtle or a bug in the program. I'll solve it this weekend.
Created 02-08-2014 07:57 AM
Thanks James, I got the data.
FWIW, it built successfully locally, which is at least good. That is not a solution, but might get you moving.
I ran the data on Hadoop (CDH5) and uncovered a different problem, which I fixed: https://github.com/cloudera/oryx/commit/fee977f6a682ba6a2e8c2e48275cb4dc5718c8b2
It is basically an artifact of having virtually all even IDs. That shouldn't be a problem of course.
It ran successfully after that. Is the data you showed me the same as what you're running, or a sample (or redacted)? Just want to rationalize why you didn't see the error I did and then see if that has any clue to why it works for me.