- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Lost of users after training
- Labels:
-
Apache Hadoop
-
Training
Created on ‎06-28-2015 11:17 PM - edited ‎09-16-2022 02:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sean,
As I posted in other discussion threads, we are trying to run Oryx 1.0 with CDH5.4.1.
One thing I noticed is that when computing with Hadoop, we have 7.5 million users in training set. However, after the training
, it generates X matrix of about 7.3 million users.
I checked the log and not found any message/error related to this. I also tried the same training dataset in the local computation (single VM),
and it gets 7.5 million users in X fine. I checked those users got lost during Hadoop computation and noticed that all their preference values
are less than 0.01... I think it's definitely making sense to ignore association of very low preference value. But, I cannot find such "config" or
"control"... Is there is such thing in Oryx ? Why it's filtering with Hadoop computation, but not in single VM computation ?
Thanks.
Created ‎06-29-2015 07:55 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Got it, that's a bug. I fixed it and pushed to master:
Created ‎06-28-2015 11:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, if model.decay.zeroThreshold is positive then anything whose abs is smaller is pruned. This can mean entire users are removed if none of their prefs survive. Do you set this or decay.factor? by default it's all off and nothing decays though.
Created ‎06-28-2015 11:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see.
Yes,
Our settings
model.decay.factor=1.0
model.decay.zeroThreshold=0.01
That explains it.
However, is it only taking effect when running with Hadoop ? We use the same setting in local computation (single VM),
but it seems not applying the threshold.
Thanks.
Created ‎06-29-2015 12:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No should work the same in both cases. You should see a message like "Pruning near-zero entries". Are you seeing that much? that would start to narrow it down.
Created ‎06-29-2015 12:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmm... Not seeing that in Oryx log.
Is that in Hadoop log ? Which job step (MergeNewOldStep? RowStep ?)?
Created ‎06-29-2015 12:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For the stand-alone version? there's no Hadoop. I mean in the Oryx log yes. I suppose my next question then is if you're sure this config is being used in your stand-alone mode? You can see where it's applied in "ReadInputs".
Created ‎06-29-2015 07:04 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is what I found. Looks odd, but can you double check it ?
(1) For Hadoop version: our settings:
model.decay.factor=1.0
model.decay.zeroThreshold=0.01
I do NOT see "Pruning near-zero entries" in the Oryx log.
However, from the results, it seems actually performing the pruning...
(2) For stand-alone version (local computation w/ one VM): same setting
model.decay.factor=1.0
model.decay.zeroThreshold=0.01
I DO see "Pruning near-zero entries" in the Oryx log.
However, from the results, it seems actually NOT performing the pruning...
(Note)
(a) For both cases I tested, it's in the generation 0. That's, there are no previous generation.
(b) Our training data looks like the following: note the it's not "pre-aggregated" by "user-id, item-id"..
user-1, item-a,1.24
user-1, item-a,0.002
user-1, item-b,0.005
user-2, item-c,0.007
user-3, item-c,0.006
user-3, item-d,2.5
Created ‎06-29-2015 07:55 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Got it, that's a bug. I fixed it and pushed to master:
Created ‎06-29-2015 08:01 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Cool.
Just read your changes and it seems it only impacts the local computation (not Hadoop computation).
Correct?
Yes, I know Hadoop computation is already doing the right thing and no need to fix.
