Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Oryx ALS: X and Y do not have sufficient rank

SOLVED Go to solution

Re: Oryx ALS: X and Y do not have sufficient rank

Explorer

I'll try the local build on one of the datanodes, that shouldn't be a problem for what I'm testing.

 

It's the full dataset, but the original data was actually userID/prodDesc/weight ... I was informed by our security team that I could send the data if I changed the prodDesc to prodID, since it's pretty meaningless without lookup tables.  So the Item variable went from a string when I was testing it, to a numeric; perhaps that's why I didn't see the same error.

 

So I'm wondering if the problem is only seen if the Item variable is a string ... easy way to test it would be to hash the prodID, which would give an alpha numeric string, similar in format to the original prodDesc.

I can hash the data and re-upload it, or you can run this little bit of python:

 

#!/usr/bin/python

import csv,hashlib,sys,os,string

INPUT_FILE = csv.reader(open("cloudera_data.csv","rb"), delimiter=",")
OUTPUT_FILE = csv.writer(open("output.csv","wb"), delimiter=",")

for data_lines in INPUT_FILE:
    data_lines[1] = string.upper(hashlib.sha1(string.strip(str(data_lines[1]),chars="\n")).hexdigest())
    OUTPUT_FILE.writerow(data_lines)

sys.exit(0)

 

Re: Oryx ALS: X and Y do not have sufficient rank

Master Collaborator

Yes that explains why you didn't see the same initial problem. Well, good that was fixed anyhow.

 

Text vs numeric shouldn't matter at all. Underneath they are both hashed. Looks the amount of data and its nature are the same if it's just that IDs were hashed. I can't imagine collisions are an issue.

 

I tried converting these 1-1 to an ID that is alphanumeric, and it worked for me.

 

You are using CDH 4.x vs 5 right? could be a different, but still don't quite expect a problem would be of this form.

Anything else of interest in the logs? you're welcome to send me all of it.

 

You're starting from scratch when you run the test ?

Re: Oryx ALS: X and Y do not have sufficient rank

Explorer

I'm using CDH5 Beta 1, with Oryx compiled against the hadoop22 profile.  Speaking of which, you may want to update the Build documentation on github, which states to use profile name "cdh5", but the pom.xml actually uses hadoop22 as the profile name.

 

I'll try running the test again tonight and see how it works out.  If I see anything else, I'll send you the log output, but I'm hoping for the best!

 

And yes, every test is started from scratch, just in case!

Re: Oryx ALS: X and Y do not have sufficient rank

Master Collaborator
Oops, fixed. Yes I'm using CDH5b1 too, so that's not a difference. Can you compile from HEAD to make sure we're synced up there? you may already be, just checking. I can make a binary too. Any logs would be of interest for sure. I suppose I would suggest trying again with clearly small values for features (like 10) and clearly small values for lambda (like 0.0001) to see if that at least works. I would expect a lower number of features might be appropriate given there are a smallish number of items. You might try the optimizer again with lower ranges for both. More features encourages overfitting and more lambda encourages underfitting, so they kind of counter-act. It's possible you find a better value when both are low.

Re: Oryx ALS: X and Y do not have sufficient rank

New Contributor

I'm able to replicate this issue as well. 

 

I've run through various combinations of lamba/feature pairs. No luck. 

 

I'm running the latest CDH4 binaries. 

 

Sean, would you like my data set?

 

Re: Oryx ALS: X and Y do not have sufficient rank

Master Collaborator
It's "normal" for this result to happen if the parameters are way out of kilter for the data set. I suppose it tends to be easier for that to happen with small data. So whether it's reproducing a problem depends on the data. But if you think the params are quite reasonable for the data and you see this, yes please send it to me.

Re: Oryx ALS: X and Y do not have sufficient rank

Explorer

Hi Sean,

 

Good news:  I recompiled and gave it a whirl giving it 10 Features and .0001 Lambda as a first pass.  Nothing abnormal or unusual jumps out at me in the output, so I believe the commits you made did the trick.

 

Generation 0 was successfully built, and it has passed the X/Y sufficient rank test.  At first glance, the recommendations seem valid, if slightly skewed for the most popular items (which is expected).  I obviously need to work on a rescorer to minimize the over-represented items.  Next thing on the list is to see if it can be fit better, so I obviously need to come up with an automated unit test.

Have you built the Optimizer into the Oryx source code, by chance, or is it just in Myrrix?

Re: Oryx ALS: X and Y do not have sufficient rank

Master Collaborator
That's good, although I am still not sure why it worked fine for me with quite different params. The transformation should not have done much. It could be that the singularity tolerance is too strict, but I doubt it. There's going to be a fairly big rewrite of the computation, to use Spark in some parts for example. As part of that I am going to build in evaluation to the pipeline itself, so that it's always tuning as it goes. It's not going to come out soon -- just in design phase -- but the idea is that this should not be something anyone has to do by hand. For practical purposes, I would just proceed with these params for now and return to the idea of optimization later. I am guessing (?) your real data set is different anyway and would require different params. Or for this data set you could use the local build.
Highlighted

Re: Oryx ALS: X and Y do not have sufficient rank

Master Collaborator

One late reply here: this bug fix may be relevant to the original problem:

 

https://github.com/cloudera/oryx/issues/99

 

I'll put this out soon in 1.0.1