Created on 02-06-2014 10:17 AM - edited 09-16-2022 08:39 AM
A little info on the system this is running on:
I'm running CDH5 Beta1 on RHEL6U5, using the parcel installation method. I've set $JAVA_HOME to the cloudera installed 1.7_25 version. Oryx was downloaded from github and built from source, using the hadoop22 profile. The data source for the ALS job is on HDFS, not local.
I have a dataset containing 3,766,950 observations in User,Product,Strength format, which I am trying to use with the Oryx ALS collaborative filtering algorithm. Roughly 67.37% of the observations have a weight of 1. My problem is that when attempting to execute the ALS job, the results are that either X or Y does not have sufficient rank, and are thus deleted.
I've attempted running the Myrrix ParameterOptimizer using the following command (3 steps, 50% sample):
java -Xmx4g -cp myrrix-serving-1.0.1.jar net.myrrix.online.eval.ParameterOptimizer data 3 .5 model.features=10:150 model.als.lambda=0.0001:1
It recommended using {model.als.lambda=1, model.features=45}, which I then used in the configuration file.
The configuration file itself is very simple:
model=${als-model} model.instance-dir=/Oryx/data model.local-computation=false model.local-data=false model.features=45 model.lambda=1 serving-layer.api.port=8093 computation-layer.api.port=8094
And the computation command:
java -Dconfig.file=als.conf -jar computation/target/oryx-computation-0.4.0-SNAPSHOT.jar
After 20m or so of processing, this is the final few lines of output:
Thu Feb 06 12:49:08 EST 2014 INFO Loading X and Y to test whether they have sufficient rank Thu Feb 06 12:49:24 EST 2014 INFO Matrix is not yet proved to be non-singular, continuing to load... Thu Feb 06 12:49:24 EST 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results Thu Feb 06 12:49:24 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/X Thu Feb 06 12:49:24 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/Y Thu Feb 06 12:49:24 EST 2014 INFO Signaling completion of generation 0 Thu Feb 06 12:49:24 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/tmp Thu Feb 06 12:49:24 EST 2014 INFO Dumping some stats on generation 0 Thu Feb 06 12:49:24 EST 2014 INFO Generation 0 complete
Any ideas on why this isn't working with using the recommended Features count and Lambda? The ALS audioscrobbler example works fine, and the data format is similar (though the strengths are considerably smaller on my dataset).
Thanks in advance,
James
Created 02-08-2014 03:08 PM
Created 02-08-2014 11:14 AM
I'll try the local build on one of the datanodes, that shouldn't be a problem for what I'm testing.
It's the full dataset, but the original data was actually userID/prodDesc/weight ... I was informed by our security team that I could send the data if I changed the prodDesc to prodID, since it's pretty meaningless without lookup tables. So the Item variable went from a string when I was testing it, to a numeric; perhaps that's why I didn't see the same error.
So I'm wondering if the problem is only seen if the Item variable is a string ... easy way to test it would be to hash the prodID, which would give an alpha numeric string, similar in format to the original prodDesc.
I can hash the data and re-upload it, or you can run this little bit of python:
#!/usr/bin/python import csv,hashlib,sys,os,string INPUT_FILE = csv.reader(open("cloudera_data.csv","rb"), delimiter=",") OUTPUT_FILE = csv.writer(open("output.csv","wb"), delimiter=",") for data_lines in INPUT_FILE: data_lines[1] = string.upper(hashlib.sha1(string.strip(str(data_lines[1]),chars="\n")).hexdigest()) OUTPUT_FILE.writerow(data_lines) sys.exit(0)
Created 02-08-2014 01:18 PM
Yes that explains why you didn't see the same initial problem. Well, good that was fixed anyhow.
Text vs numeric shouldn't matter at all. Underneath they are both hashed. Looks the amount of data and its nature are the same if it's just that IDs were hashed. I can't imagine collisions are an issue.
I tried converting these 1-1 to an ID that is alphanumeric, and it worked for me.
You are using CDH 4.x vs 5 right? could be a different, but still don't quite expect a problem would be of this form.
Anything else of interest in the logs? you're welcome to send me all of it.
You're starting from scratch when you run the test ?
Created 02-08-2014 02:59 PM
I'm using CDH5 Beta 1, with Oryx compiled against the hadoop22 profile. Speaking of which, you may want to update the Build documentation on github, which states to use profile name "cdh5", but the pom.xml actually uses hadoop22 as the profile name.
I'll try running the test again tonight and see how it works out. If I see anything else, I'll send you the log output, but I'm hoping for the best!
And yes, every test is started from scratch, just in case!
Created 02-08-2014 03:08 PM
Created 02-09-2014 07:39 PM
I'm able to replicate this issue as well.
I've run through various combinations of lamba/feature pairs. No luck.
I'm running the latest CDH4 binaries.
Sean, would you like my data set?
Created 02-10-2014 03:29 AM
Created 02-10-2014 08:01 AM
Hi Sean,
Good news: I recompiled and gave it a whirl giving it 10 Features and .0001 Lambda as a first pass. Nothing abnormal or unusual jumps out at me in the output, so I believe the commits you made did the trick.
Generation 0 was successfully built, and it has passed the X/Y sufficient rank test. At first glance, the recommendations seem valid, if slightly skewed for the most popular items (which is expected). I obviously need to work on a rescorer to minimize the over-represented items. Next thing on the list is to see if it can be fit better, so I obviously need to come up with an automated unit test.
Have you built the Optimizer into the Oryx source code, by chance, or is it just in Myrrix?
Created 02-10-2014 08:50 AM
Created 12-07-2014 04:08 PM
One late reply here: this bug fix may be relevant to the original problem:
https://github.com/cloudera/oryx/issues/99
I'll put this out soon in 1.0.1