Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Lost of users after training

avatar
Explorer

Sean,

 

As I posted in other discussion threads, we are trying to run Oryx 1.0 with CDH5.4.1.

 

One thing I noticed is that when computing with Hadoop, we have 7.5 million users in training set. However, after the training

, it generates X matrix of about 7.3 million users.

 

I checked the log and not found any message/error related to this. I also tried the same training dataset in the local computation (single VM),

and it gets 7.5 million users in X fine.  I checked those users got lost during Hadoop computation and noticed that all their preference values

are less than 0.01... I think it's definitely making sense to ignore association of very low preference value. But, I cannot find such "config" or

"control"... Is there is such thing in Oryx ? Why it's filtering with Hadoop computation, but not in single VM computation ?

 

Thanks.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Got it, that's a bug. I fixed it and pushed to master:

https://github.com/cloudera/oryx/issues/115

View solution in original post

8 REPLIES 8

avatar
Master Collaborator

Yes, if model.decay.zeroThreshold is positive then anything whose abs is smaller is pruned. This can mean entire users are removed if none of their prefs survive. Do you set this or decay.factor? by default it's all off and nothing decays though.

avatar
Explorer

I see.

 

Yes,

Our settings

model.decay.factor=1.0
model.decay.zeroThreshold=0.01

 

That explains it.

 

However, is it only taking effect when running with Hadoop ? We use the same setting in local computation (single VM),

but it seems not applying the threshold.

 

Thanks.

avatar
Master Collaborator

No should work the same in both cases. You should see a message like "Pruning near-zero entries". Are you seeing that much? that would start to narrow it down.

avatar
Explorer

Hmm... Not seeing that in Oryx log.

Is that in Hadoop log ? Which job step (MergeNewOldStep? RowStep ?)?

avatar
Master Collaborator

For the stand-alone version? there's no Hadoop. I mean in the Oryx log yes. I suppose my next question then is if you're sure this config is being used in your stand-alone mode? You can see where it's applied in "ReadInputs".

avatar
Explorer

This is what I found. Looks odd, but can you double check it ?

 

(1) For Hadoop version: our settings:

model.decay.factor=1.0
model.decay.zeroThreshold=0.01

I do NOT see "Pruning near-zero entries" in the Oryx log.

However, from the results, it seems actually performing the pruning...

 

(2) For stand-alone version (local computation w/ one VM): same setting

model.decay.factor=1.0
model.decay.zeroThreshold=0.01

I DO see "Pruning near-zero entries" in the Oryx log.

However, from the results, it seems actually NOT performing the pruning...

 

(Note)

(a) For both cases I tested, it's in the generation 0. That's, there are no previous generation.

(b) Our training data looks like the following: note the it's not "pre-aggregated" by "user-id, item-id"..

user-1, item-a,1.24

user-1, item-a,0.002

user-1, item-b,0.005

user-2, item-c,0.007

user-3, item-c,0.006

user-3, item-d,2.5

 

 

 

 

 

 

 

 

avatar
Master Collaborator

Got it, that's a bug. I fixed it and pushed to master:

https://github.com/cloudera/oryx/issues/115

avatar
Explorer

Cool.

 

Just read your changes and it seems it only impacts the local computation (not Hadoop computation).

Correct?

 

Yes, I know Hadoop computation is already doing the right thing and no need to fix.