I tried setting up Oryx on CDH quick start VM and I ran a sample collaborative filtering example with ALS algorithm on local mode which came out good.
But when i try with an actual data with 300 MB local mode , i do not see any progress(X,Y folder cteation etc), i only see stats.json, computation.conf file & _SUCEESS(0KB) only getting created , nothing else , How can i ensure the computation is running, is there any other location where logs are generated?
[root@localhost 00000]# ls -lrt
drwxrwxr-x 2 cloudera cloudera 4096 Mar 24 08:26 inbound
-rwxr-xr-x 1 cloudera cloudera 5252 Mar 24 08:35 computation.conf
-rwxr-xr-x 1 cloudera cloudera 0 Mar 24 08:36 _SUCCESS
-rwxr-xr-x 1 cloudera cloudera 50 Mar 24 08:36 stats.json
What is in inbound? file names, contents? One possibility is that the file isn't named in a recognized way.
What's in the log output from the time when it should build the model?
I think i had the file in .dat format , which i changed to .csv and this time it worked. Is it mandatory to have the file in CSV always?
Also , I had 13292825 distinct users & 13558965 distinct items in the file but the presentation prints log as below.
INFO: All model elements loaded, 5167 users and 157950 items
The recommendation is same for most of the random users i have tried, this doesnt match with the Mahout recommendation that i have got . Am i missing something.?
Yeah it's expecting .csv, .zip or .gz files. You can change the delimiter that it expects in the config file. I'm questioning whether it really makes sense to filter on file name here at all.
Are you sure the data is in CSV? and maybe there is still some mismatch between what's being read and what you put in.
Start from an empty directory, and add directory "00000/inbound" under it. Put your data files there. Make sure your config file points to the root directory above 00000. That's a good way to make sure it's all from a clean slate.
Maybe you can show some of your data lines to double check that?
Thanks for your prompt response,
[cloudera@localhost oryx]$ head -10 Trade/00000/inbound/GroomedTradeDataForMahout.csv
That looks just fine to me. And you verified there are the number of distinct values in these files that you expect? I usually do something like
cut -d, -f1 | sort -u | wc -l
What happens with a fresh run, just to rule out accidentally using old data/config?
If it's still the same number, could I have a copy of your data file (offline is fine -- sowen at cloudera)