- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Overall questions about Oryx 2
- Labels:
-
Apache Kafka
-
Apache Spark
-
HDFS
Created on ‎08-23-2015 11:14 AM - edited ‎09-16-2022 02:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sean,
Several questions about Oryx 2:
(1) I know Oryx 2 uses kafka for data pipeline. Does Oryx2 also use Spark Streaming ?
(2) Regarding the update and input topics saved with kafka... If the model is big (say, ~50 GB), it occupies kafka mem (and disk) usage..
right ? Is there a way that serving layer getting model from HDFS directly, while speed layer still able to approximate the predictions
based on real-time events ?
(3) Is the model saved in kafka distributed across the cluster nodes ?
Thanks.
Jason
Created ‎08-26-2015 01:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oops, thanks for catching that. Yes the serving layer needs to see HDFS to read big models. You can change a few kafka and oryx configs to allow very big models as kafka messages and thus bigger models if needed, though ideally the serving layer can just see HDFS.
I had also envisioned that the serving layer is often run in or next to the cluster, and isn't publicly visible. It's a service to other front-end systems, or at least behind a load balancer. So exposing a machine with cluster access isn't so crazy as it need not be open to the world.
Created ‎08-23-2015 12:26 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, it uses Spark streaming for the batch and speed layers.
Really big models are just 'passed' to the topic as an HDFS location. The max is configurable but is about 16MB. This tends to only matter for decision forests or ALS models with large numbers of users and items.
The data in Kafka topics is replicated according to the topic config. Yes it can potentially be replicated across the machines that server as brokers.
Created ‎08-25-2015 07:08 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quick check...
Does it imply Oryx 2 serving layer can read model from HDFS directly (if the model is big)?
Thanks.
Jason
Created ‎08-26-2015 01:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oops, thanks for catching that. Yes the serving layer needs to see HDFS to read big models. You can change a few kafka and oryx configs to allow very big models as kafka messages and thus bigger models if needed, though ideally the serving layer can just see HDFS.
I had also envisioned that the serving layer is often run in or next to the cluster, and isn't publicly visible. It's a service to other front-end systems, or at least behind a load balancer. So exposing a machine with cluster access isn't so crazy as it need not be open to the world.
Created ‎09-03-2015 03:18 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another question about Oryx 2.
The CSV training data is with Unix timestamp.
(1) What's it for ?
(2) Does it matter in the unit of seconds or milliseconds ?
Thanks.
Jason
Created ‎09-03-2015 03:26 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Timestamp is for ordering, and for determining decay of the strength factor. The ordering of events is not guaranteed by HDFS / Kafka, and does matter to some extent, especially if there are 'delete' events. It also matters when figuring out how old a data point is and how much its value has decayed, if it's enabled.
You could use seconds or milliseconds, I suppose, if you used them consistently. However the serving layer uses a standard ms timestamp, so that's probably best to emulate.
