Support Questions

JitinK · ‎10-14-2015

Hello Mr. Owen,

Great Work With Oryx2..!!

I am trying to understand the to algorithm / flow of Oryx2...

From Class SpeedLayer, I understand that we have two kafka thread:

1) Which runs consume method

2) Which calls SpeedLayerUpdate which actually calling "buildupdates" from ALSSpeedModelManager.

When I try to imitate this, I find that "buildUpdates" is generating updates based on X featuers Vector. Now if a new rating is given, buildUpdate generates features and this message is pushed to kafka topic. What I dont understand is how this new features get internalised by Speed model.

If message is "only" read by Serving Layer then, Speed Layer wont have this new features for its future reference.

So is the update from SpeedLayerUpdate is listened by cosume simlutaneously with Serving Layer or I am missing something where this "buildUpdates" is ineternalising its updates to ALSSpeedModel.

Also suggested from flow diagram of Oryx2, the model updates are not read by Speed Layer.

Please help my ignorance.

srowen · ‎10-14-2015

Oh! on re-reading this, I realize it already consumes its own updates, actually. It took a moment of reading this to recall the architecture. I should really add a note in the source code.

The setUserVector/setItemVector you see is actually where it consumes updates from the batch layer. The batch layer generally does not produce updates of course, but this is again a special case. The ALS model is so large that it has to be shipped around as a huge set of updates. This is tidy. But, this also means it is listening to its own updates and processing them in exactly the same way. So -- at a short delay -- it is hearing its own updates and applying them.

Even if this were not so, the speed layer would still be producing updates in response to new input immediately. The question is merely what model is used to compute the update.

View solution in original post

srowen · ‎10-14-2015

The short answer is that it does not internalize updates itself. It's an interesting question of design. Of course, an updated model matters when answering queries in the serving layer. When just being used to determine how a new input changes the model, it's not necessarily important to have consumed prior updates to compute a good-enough update.

From an implementation perspective, it makes things significantly simpler; the model updates, in general, are intended to be computed in a distributed way with Spark. If they were also updating the model, it'd be hard and slow to also coordinate those in-memory updates meaningfully. The price of course is that the speed layer itself isn't learning in real time, even if that isn't actually nearly as important as the serving layer doing so.

Now, interestingly, for ALS, the model itself is so big that the updates themselves can't be computed in a distributed way. It's actually done serially on the driver, on one big copy in memory. So it would be realistic to apply model updates as it goes. I'm going to file this as a "to do" to think about further since it's also the model where it matters more than others.

It also occurs to me that, for this reason, the driver should multi-thread its computation of the updates for ALS. Also a to-do

JitinK · ‎10-14-2015

I agree that it will make it computationally slow to add buildUpdates to be consumed by Speed Layer. Then what is the actual need of using "setUserVector" and "setItemvector" in Speed Layer or at actually at what point in time does it get invoked.

If I understand correctly that , suppose we pass three new rating in 2 seconds, then Serving layer will hold only features updated from last rating, as updates from buildUpdates will be built on old features only and its prevoius updates will be over-wrriten. And since speed layer model dosen't have these updated features it cannot do learning on running user behavior. Its only till batch layer re-runs and batch model gets updated, we can do learning about user behaviour..??

Am I correct or is am I missing something..??

And if so then how can we learn about new user at earliest, or can we add these updates to another update builder for serving layer..??

srowen · ‎10-14-2015

Oh! on re-reading this, I realize it already consumes its own updates, actually. It took a moment of reading this to recall the architecture. I should really add a note in the source code.

The setUserVector/setItemVector you see is actually where it consumes updates from the batch layer. The batch layer generally does not produce updates of course, but this is again a special case. The ALS model is so large that it has to be shipped around as a huge set of updates. This is tidy. But, this also means it is listening to its own updates and processing them in exactly the same way. So -- at a short delay -- it is hearing its own updates and applying them.

Even if this were not so, the speed layer would still be producing updates in response to new input immediately. The question is merely what model is used to compute the update.

JitinK · ‎10-14-2015

Now thats what I thought..!!

I can understand it build updates anyways but the fact fact it didnt learnt from new entries made me whirling and digging the code...!!

So in this case can you update the api docs and flow chart as there may be other guys who might get confused like me..!!!

Thanks is andvance and thanks for clearing my doubt..!!

Also how can I contribute to its development..??!!

srowen · ‎10-14-2015

It's tricky because in general the ALS implementation we are talking about is a special case compared to normal models, but it's a big special case. I think the general architecture is correct at the level it's presented. I don't want to complicate it too much.

Your feedback is a value contribution. Problems and bug fixes are important, but also ways the architecture could be improved or opened up.

Cloudera Community

Support Questions

How Speed Layer Internalizes Updates..!!