Support Questions

JasonChen · ‎08-23-2015

Sean,

Several questions about Oryx 2:

(1) I know Oryx 2 uses kafka for data pipeline. Does Oryx2 also use Spark Streaming ?

(2) Regarding the update and input topics saved with kafka... If the model is big (say, ~50 GB), it occupies kafka mem (and disk) usage..

right ? Is there a way that serving layer getting model from HDFS directly, while speed layer still able to approximate the predictions

based on real-time events ?

(3) Is the model saved in kafka distributed across the cluster nodes ?

Thanks.

Jason

srowen · ‎08-26-2015

Oops, thanks for catching that. Yes the serving layer needs to see HDFS to read big models. You can change a few kafka and oryx configs to allow very big models as kafka messages and thus bigger models if needed, though ideally the serving layer can just see HDFS.

I had also envisioned that the serving layer is often run in or next to the cluster, and isn't publicly visible. It's a service to other front-end systems, or at least behind a load balancer. So exposing a machine with cluster access isn't so crazy as it need not be open to the world.

View solution in original post

srowen · ‎08-23-2015

Yes, it uses Spark streaming for the batch and speed layers.

Really big models are just 'passed' to the topic as an HDFS location. The max is configurable but is about 16MB. This tends to only matter for decision forests or ALS models with large numbers of users and items.

The data in Kafka topics is replicated according to the topic config. Yes it can potentially be replicated across the machines that server as brokers.

JasonChen · ‎08-25-2015

Quick check...

Does it imply Oryx 2 serving layer can read model from HDFS directly (if the model is big)?

Thanks.

Jason

srowen · ‎08-26-2015

Oops, thanks for catching that. Yes the serving layer needs to see HDFS to read big models. You can change a few kafka and oryx configs to allow very big models as kafka messages and thus bigger models if needed, though ideally the serving layer can just see HDFS.

I had also envisioned that the serving layer is often run in or next to the cluster, and isn't publicly visible. It's a service to other front-end systems, or at least behind a load balancer. So exposing a machine with cluster access isn't so crazy as it need not be open to the world.

JasonChen · ‎09-03-2015

Another question about Oryx 2.

The CSV training data is with Unix timestamp.

(1) What's it for ?

(2) Does it matter in the unit of seconds or milliseconds ?

Thanks.

Jason

srowen · ‎09-03-2015

Timestamp is for ordering, and for determining decay of the strength factor. The ordering of events is not guaranteed by HDFS / Kafka, and does matter to some extent, especially if there are 'delete' events. It also matters when figuring out how old a data point is and how much its value has decayed, if it's enabled.

You could use seconds or milliseconds, I suppose, if you used them consistently. However the serving layer uses a standard ms timestamp, so that's probably best to emulate.

Cloudera Community

Support Questions

Overall questions about Oryx 2