About srowen

srowen · ‎09-02-2015

That generally means it's still waiting for YARN to allocate an executor, and that in turn usually means you don't have enough resources free in YARN to satisfy the request. Check your number and size of executors vs available resources and max size of any one container that your YARN config allows.

srowen · ‎08-26-2015

This may be too unspecific to be helpful, but I recall several JIRAs fixed for Spark 1.4 that concern the .inprogress files and history server. I expect that whatever this is could be related. If so, then the fix would be coming in 5.5 at the latest.

srowen · ‎08-26-2015

Oops, thanks for catching that. Yes the serving layer needs to see HDFS to read big models. You can change a few kafka and oryx configs to allow very big models as kafka messages and thus bigger models if needed, though ideally the serving layer can just see HDFS. I had also envisioned that the serving layer is often run in or next to the cluster, and isn't publicly visible. It's a service to other front-end systems, or at least behind a load balancer. So exposing a machine with cluster access isn't so crazy as it need not be open to the world.

srowen · ‎08-25-2015

Oh, this section configures the serving layer REST API -- what port it runs on, SSL cert, password, path, etc.

srowen · ‎08-25-2015

What do you mean by part API?

srowen · ‎08-24-2015

Oryx uses Spark Streaming, and Spark runs its executors on YARN. So YARN manages the resources used by the batch and speed layer. You can also use YARN to run the serving layer binaries via the oryx-run.sh script.

srowen · ‎08-23-2015

There shouldn't be any other dependencies. If the error is like what you showed before, it's just firewall/port config problems.

srowen · ‎08-23-2015

Yes, it uses Spark streaming for the batch and speed layers. Really big models are just 'passed' to the topic as an HDFS location. The max is configurable but is about 16MB. This tends to only matter for decision forests or ALS models with large numbers of users and items. The data in Kafka topics is replicated according to the topic config. Yes it can potentially be replicated across the machines that server as brokers.

srowen · ‎08-23-2015

Right, I forgot to mention that part: you need the cluster's binaries too, like ZK, HDFS, YARN, Spark, etc. It is using the cluster's distribution. As you can see, it's definitely intended to be run on a cluster edge node, so I'd strongly suggest running it that way.

srowen · ‎08-21-2015

You can run the binaries on any machine that can see the Hadoop configuration on the classpath, and which can access all of the services it needs to in the cluster. There are a number of services to talk to: HDFS, YARN, Kafka, Spark and the app's executors. So in general you'd have to have a lot of ports open, and at that point your machine is effectively a gateway node in the cluster. Certainly it's meant to be run within the cluster. The serving layer only needs access to Kafka, and that's by design, so it might more easily run outside the cluster.

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Endless INFO Client: Application report for ap...

Re: Spark Application history not found Applicatio...

Re: Overall questions about Oryx 2

Re: Oryx and Yarn

Re: Oryx and Yarn

Re: Oryx and Yarn

Re: Run Oryx on a machine that is not part of the ...

Re: Overall questions about Oryx 2

Re: Run Oryx on a machine that is not part of the ...

Re: Run Oryx on a machine that is not part of the ...