Support Questions

Jason.Chen · ‎06-28-2015

Sean,

As I posted in other discussion threads, we are trying to run Oryx 1.0 with CDH5.4.1.

One thing I noticed is that when computing with Hadoop, we have 7.5 million users in training set. However, after the training

, it generates X matrix of about 7.3 million users.

I checked the log and not found any message/error related to this. I also tried the same training dataset in the local computation (single VM),

and it gets 7.5 million users in X fine. I checked those users got lost during Hadoop computation and noticed that all their preference values

are less than 0.01... I think it's definitely making sense to ignore association of very low preference value. But, I cannot find such "config" or

"control"... Is there is such thing in Oryx ? Why it's filtering with Hadoop computation, but not in single VM computation ?

Thanks.

srowen · ‎06-29-2015

Got it, that's a bug. I fixed it and pushed to master:

https://github.com/cloudera/oryx/issues/115

View solution in original post

srowen · ‎06-28-2015

Yes, if model.decay.zeroThreshold is positive then anything whose abs is smaller is pruned. This can mean entire users are removed if none of their prefs survive. Do you set this or decay.factor? by default it's all off and nothing decays though.

Jason.Chen · ‎06-28-2015

I see.

Yes,

Our settings

model.decay.factor=1.0
model.decay.zeroThreshold=0.01

That explains it.

However, is it only taking effect when running with Hadoop ? We use the same setting in local computation (single VM),

but it seems not applying the threshold.

Thanks.

srowen · ‎06-29-2015

No should work the same in both cases. You should see a message like "Pruning near-zero entries". Are you seeing that much? that would start to narrow it down.

Jason.Chen · ‎06-29-2015

Hmm... Not seeing that in Oryx log.

Is that in Hadoop log ? Which job step (MergeNewOldStep? RowStep ?)?

srowen · ‎06-29-2015

For the stand-alone version? there's no Hadoop. I mean in the Oryx log yes. I suppose my next question then is if you're sure this config is being used in your stand-alone mode? You can see where it's applied in "ReadInputs".

Jason.Chen · ‎06-29-2015

This is what I found. Looks odd, but can you double check it ?

(1) For Hadoop version: our settings:

model.decay.factor=1.0
model.decay.zeroThreshold=0.01

I do NOT see "Pruning near-zero entries" in the Oryx log.

However, from the results, it seems actually performing the pruning...

(2) For stand-alone version (local computation w/ one VM): same setting

model.decay.factor=1.0
model.decay.zeroThreshold=0.01

I DO see "Pruning near-zero entries" in the Oryx log.

However, from the results, it seems actually NOT performing the pruning...

(Note)

(a) For both cases I tested, it's in the generation 0. That's, there are no previous generation.

(b) Our training data looks like the following: note the it's not "pre-aggregated" by "user-id, item-id"..

user-1, item-a,1.24

user-1, item-a,0.002

user-1, item-b,0.005

user-2, item-c,0.007

user-3, item-c,0.006

user-3, item-d,2.5

srowen · ‎06-29-2015

Got it, that's a bug. I fixed it and pushed to master:

https://github.com/cloudera/oryx/issues/115

Jason.Chen · ‎06-29-2015

Cool.

Just read your changes and it seems it only impacts the local computation (not Hadoop computation).

Correct?

Yes, I know Hadoop computation is already doing the right thing and no need to fix.

Cloudera Community

Support Questions

Lost of users after training