Member since
02-06-2014
19
Posts
0
Kudos Received
0
Solutions
02-11-2014
01:50 PM
Thanks for the input ... I think I mis-read the title of this forum, as I thought it was for installations of CDH via CM, sorry! I was afraid that was going to be the answer. I literally just finished building a POC cluster of CDH5B1 and loading 30TB of test data last Friday. I didn't really want to rebuild the cluster for CDH5B2 if it wasn't required, to be honest.
... View more
02-11-2014
01:11 PM
I've reviewed the upgrade documentation for CDH5B2, but unfortunately it seems to only cover package installations. I'm assuming that parcel upgrades are effectively not possible, since everything has to be offline to perform the upgrade? (by the way, you need to fix your label requirements for posting to this forum ... I believe CDH5B2 uses CM 5.0B2) Thanks, James
... View more
Labels:
- Labels:
-
Cloudera Manager
02-10-2014
08:01 AM
Hi Sean, Good news: I recompiled and gave it a whirl giving it 10 Features and .0001 Lambda as a first pass. Nothing abnormal or unusual jumps out at me in the output, so I believe the commits you made did the trick. Generation 0 was successfully built, and it has passed the X/Y sufficient rank test. At first glance, the recommendations seem valid, if slightly skewed for the most popular items (which is expected). I obviously need to work on a rescorer to minimize the over-represented items. Next thing on the list is to see if it can be fit better, so I obviously need to come up with an automated unit test. Have you built the Optimizer into the Oryx source code, by chance, or is it just in Myrrix?
... View more
02-08-2014
02:59 PM
I'm using CDH5 Beta 1, with Oryx compiled against the hadoop22 profile. Speaking of which, you may want to update the Build documentation on github, which states to use profile name "cdh5", but the pom.xml actually uses hadoop22 as the profile name. I'll try running the test again tonight and see how it works out. If I see anything else, I'll send you the log output, but I'm hoping for the best! And yes, every test is started from scratch, just in case!
... View more
02-08-2014
11:14 AM
I'll try the local build on one of the datanodes, that shouldn't be a problem for what I'm testing. It's the full dataset, but the original data was actually userID/prodDesc/weight ... I was informed by our security team that I could send the data if I changed the prodDesc to prodID, since it's pretty meaningless without lookup tables. So the Item variable went from a string when I was testing it, to a numeric; perhaps that's why I didn't see the same error. So I'm wondering if the problem is only seen if the Item variable is a string ... easy way to test it would be to hash the prodID, which would give an alpha numeric string, similar in format to the original prodDesc. I can hash the data and re-upload it, or you can run this little bit of python: #!/usr/bin/python
import csv,hashlib,sys,os,string
INPUT_FILE = csv.reader(open("cloudera_data.csv","rb"), delimiter=",")
OUTPUT_FILE = csv.writer(open("output.csv","wb"), delimiter=",")
for data_lines in INPUT_FILE:
data_lines[1] = string.upper(hashlib.sha1(string.strip(str(data_lines[1]),chars="\n")).hexdigest())
OUTPUT_FILE.writerow(data_lines)
sys.exit(0)
... View more
02-07-2014
11:18 AM
I'm cleared to send the dataset, just need to know where it's going! James
... View more
02-07-2014
05:08 AM
Some of the items are exceptionally popular, while a large number of the other items have very low values. The weight is a simple count of the items per user within a timeframe. So a userID/itemID combo should only be seen once, but some of those items are seen for a very large percentage of the userIDs. I've tried setting Features to 5, and Lambda to .01, which also failed. I'll try setting Features to 3 and Lambda to .0001 and see if that has any effect. I'll verify with our Legal dept about sending the data over, but it shouldn't be an issue. I know I have your card from when we met in London and NY Strata, but it's in my desk at work, and I'm working from home, so you might have to message me with your email address or data drop location.
... View more
02-06-2014
01:50 PM
The model is indeed being built from the full dataset, while the optimization was performed against a 50% sample. To get the sample, I downloaded the dataset from HDFS to the local filesystem, and performed a "head -n 1883475 data50percent.csv". Then I ran the optimizer locally, not distributed. Should I use the full dataset instead? Dataset size is 125MB Number of records 3,766,950 Unique users 608146 Unique items 1151
... View more
02-06-2014
11:33 AM
Hi Sean, I should have mentioned that I've tried a few variations, each resulting in the same error each time. I've tried the following combinations so far, each with the same result as when I followed the recommended Feature/Lambda settings: Features : Lambda 20 : 0.065 100 : 0.065 45 : 1 45 : 0.1 50 : 0.1 All of those combinations end with the following error: Thu Feb 06 14:20:37 EST 2014 INFO Loading X and Y to test whether they have sufficient rank
Thu Feb 06 14:20:50 EST 2014 INFO Matrix is not yet proved to be non-singular, continuing to load...
Thu Feb 06 14:20:50 EST 2014 WARNING X or Y does not have sufficient rank; deleting this model and its results
Thu Feb 06 14:20:50 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/X
Thu Feb 06 14:20:50 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/Y
Thu Feb 06 14:20:50 EST 2014 INFO Signaling completion of generation 0
Thu Feb 06 14:20:50 EST 2014 INFO Deleting recursively: hdfs://nameservice1/Oryx/data/00000/tmp
Thu Feb 06 14:20:50 EST 2014 INFO Dumping some stats on generation 0
Thu Feb 06 14:20:50 EST 2014 INFO Generation 0 complete
... View more