About srowen

srowen · ‎06-28-2015

It doesn't do any communciation of its own; this is all traffic to/from the Hadoop cluster for HDFS and YARN. Hadoop has no idea about the oryx process. It should be dead simple in this regard. I don't think those are well-known ports so maybe this is it trying to talk to the YARN app that runs the MapReduce? what is failing at that point? I would expect the serving to be more predictable as it only needs to talk to HDFS and those daemons should be on well known ports. In any event it's "just" standard Hadoop mechanisms here, which may mean you can ask support for assistance about how to constrain the ports that are used? but in general the computation layer needs to be close to the cluster and is intended to be inside its firewall.

srowen · ‎06-25-2015

These are instructions for installing via packages, which is not the usual way to do it. Do you really intend this? If so have you set up the Cloudera repos? Generally you manage using parcels, and yes updating Spark means updating CDH, since you're talking about updating many other harmonized dependencies.

srowen · ‎06-22-2015

Sure guys, let me know if it seems to work. Once this is resolved I am going to cut a 1.1.0 release.

srowen · ‎06-21-2015

Tell me more about your setup -- what's your config, how much data are you sending? do you see any log messages about "Mean average precision:" ?

srowen · ‎06-21-2015

I think this is an issue with your installation of the native Snappy libs in your environment. The native snappy code isn't finding the right libstd on your system. You'll either need to address that, or remove snappy.

srowen · ‎06-14-2015

This concerns version 1.x by the way. The config elements in question are here: https://github.com/cloudera/oryx/blob/master/common/src/main/resources/reference.conf#L136

srowen · ‎06-13-2015

I have a new branch with a better approach: https://github.com/cloudera/oryx/tree/Issue112 Are you able to build and try this branch? i can send you a binary too.

srowen · ‎06-12-2015

Ah OK, I think I understand this now. I made two little mistakes here. First was overlooking that you have actually a small number of items -- about a thousand right? which matches the number of records into the sampling function. And on re-reading the code I see that the job is invoked over _items_, so really the loop is over items and then users, despite the names in the code. That is why there is so little input to this job -- they're item IDs. So, choosing the sampling rate based on the number of reducers is a little problematic, but reasonable. However, the number of reducers you have may be suitable for the size of your users, but not the size of the items, which may be very different. This is a bit of a deeper suboptimality, since in your case your jobs have very different numbers of users and items. Normally it just means your item jobs in the iterations have more reducers than necessary, which is just a little extra overhead. But it has also manifested here as an actual problem for the way this convergence heuristic works. One option is to let the user override the sampling rate, but it seems like something the user shouldn't have to set. Another option is to expose control over the number of reducers for the user- and item-related jobs separately. That might be a good idea for reasons above, although it's a slightly unrelated issue. More directly, I'm going to look at ways to efficiently count the number of users and items and choose a sampling rate accordingly. If it's too low, nothing is sampled; if it's too high, the sampling takes a long time. I had hoped to avoid another job to just do this counting, but maybe there is an efficient way to figure it out. Let me do some homework.

srowen · ‎06-12-2015

Does this have ConvergenceSampleFn in the name? that's the bit of interest. If that's what you're looking at then this indicates that only 1190 users are in the input. Yes, we already know there are 0 output records and yes that is the problem. So now the question to me is, why is that happening? Stepping back, how much input is really going in to the first MapReduce jobs? is it really consistent with the data set size you expect, 7.5M users? You could browse the MR jobs to walk back and find out where the size of the data seems to diverge from what's normal. That's orders of magnitude different. that might help narrow down what's happening.

srowen · ‎06-12-2015

Sampling happens on every iteration. It has to record the current estimates for the same sampled users/items, and those change on each iteration. On the second iteration it's possible to compare the current vs previous sample estimates to assess convergence. Yes the sampling function is the same for both users and items; it's all in that function above. The next thing I'd check are statistics from the MapReduce job that runs for ConvergenceSampleFn. How many records went into the reducer and came out? I assume 0 were emitted, but I'm wondering if somehow it's running on just a small set of the data. That would at least explain it but I don't know if that's the case. You should see about 7.5M records into the reducer, I believe.

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Run Oryx on a machine that is not part of the ...

Re: Issues upgrading Spark from Spark 1.3 -> Spark...

Re: Oryx ALS running with Hadoop

Re: training and testing data

Re: Oryx ALS running with Hadoop

Re: Oryx: API method unavailable until model has b...

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop