Created on 08-23-2015 11:14 AM - edited 09-16-2022 02:38 AM
Sean,
Several questions about Oryx 2:
(1) I know Oryx 2 uses kafka for data pipeline. Does Oryx2 also use Spark Streaming ?
(2) Regarding the update and input topics saved with kafka... If the model is big (say, ~50 GB), it occupies kafka mem (and disk) usage..
right ? Is there a way that serving layer getting model from HDFS directly, while speed layer still able to approximate the predictions
based on real-time events ?
(3) Is the model saved in kafka distributed across the cluster nodes ?
Thanks.
Jason
Created 08-26-2015 01:02 AM
Oops, thanks for catching that. Yes the serving layer needs to see HDFS to read big models. You can change a few kafka and oryx configs to allow very big models as kafka messages and thus bigger models if needed, though ideally the serving layer can just see HDFS.
I had also envisioned that the serving layer is often run in or next to the cluster, and isn't publicly visible. It's a service to other front-end systems, or at least behind a load balancer. So exposing a machine with cluster access isn't so crazy as it need not be open to the world.
Created 08-23-2015 12:26 PM
Yes, it uses Spark streaming for the batch and speed layers.
Really big models are just 'passed' to the topic as an HDFS location. The max is configurable but is about 16MB. This tends to only matter for decision forests or ALS models with large numbers of users and items.
The data in Kafka topics is replicated according to the topic config. Yes it can potentially be replicated across the machines that server as brokers.
Created 08-25-2015 07:08 PM
Quick check...
Does it imply Oryx 2 serving layer can read model from HDFS directly (if the model is big)?
Thanks.
Jason
Created 08-26-2015 01:02 AM
Oops, thanks for catching that. Yes the serving layer needs to see HDFS to read big models. You can change a few kafka and oryx configs to allow very big models as kafka messages and thus bigger models if needed, though ideally the serving layer can just see HDFS.
I had also envisioned that the serving layer is often run in or next to the cluster, and isn't publicly visible. It's a service to other front-end systems, or at least behind a load balancer. So exposing a machine with cluster access isn't so crazy as it need not be open to the world.
Created 09-03-2015 03:18 PM
Another question about Oryx 2.
The CSV training data is with Unix timestamp.
(1) What's it for ?
(2) Does it matter in the unit of seconds or milliseconds ?
Thanks.
Jason
Created 09-03-2015 03:26 PM
Timestamp is for ordering, and for determining decay of the strength factor. The ordering of events is not guaranteed by HDFS / Kafka, and does matter to some extent, especially if there are 'delete' events. It also matters when figuring out how old a data point is and how much its value has decayed, if it's enabled.
You could use seconds or milliseconds, I suppose, if you used them consistently. However the serving layer uses a standard ms timestamp, so that's probably best to emulate.