Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2691 | 01-26-2018 04:02 AM | |
5662 | 12-22-2017 09:18 AM | |
2662 | 12-05-2017 06:13 AM | |
2961 | 10-16-2017 07:55 AM | |
8197 | 10-04-2017 08:08 PM |
06-28-2015
11:33 AM
It doesn't do any communciation of its own; this is all traffic to/from the Hadoop cluster for HDFS and YARN. Hadoop has no idea about the oryx process. It should be dead simple in this regard. I don't think those are well-known ports so maybe this is it trying to talk to the YARN app that runs the MapReduce? what is failing at that point? I would expect the serving to be more predictable as it only needs to talk to HDFS and those daemons should be on well known ports. In any event it's "just" standard Hadoop mechanisms here, which may mean you can ask support for assistance about how to constrain the ports that are used? but in general the computation layer needs to be close to the cluster and is intended to be inside its firewall.
... View more
06-25-2015
11:36 PM
These are instructions for installing via packages, which is not the usual way to do it. Do you really intend this? If so have you set up the Cloudera repos? Generally you manage using parcels, and yes updating Spark means updating CDH, since you're talking about updating many other harmonized dependencies.
... View more
06-22-2015
01:37 AM
Sure guys, let me know if it seems to work. Once this is resolved I am going to cut a 1.1.0 release.
... View more
06-21-2015
03:40 AM
Tell me more about your setup -- what's your config, how much data are you sending? do you see any log messages about "Mean average precision:" ?
... View more
06-21-2015
03:08 AM
I think this is an issue with your installation of the native Snappy libs in your environment. The native snappy code isn't finding the right libstd on your system. You'll either need to address that, or remove snappy.
... View more
06-14-2015
12:45 PM
1 Kudo
This concerns version 1.x by the way. The config elements in question are here: https://github.com/cloudera/oryx/blob/master/common/src/main/resources/reference.conf#L136
... View more
06-13-2015
01:04 AM
I have a new branch with a better approach: https://github.com/cloudera/oryx/tree/Issue112 Are you able to build and try this branch? i can send you a binary too.
... View more
06-12-2015
12:56 PM
1 Kudo
Ah OK, I think I understand this now. I made two little mistakes here. First was overlooking that you have actually a small number of items -- about a thousand right? which matches the number of records into the sampling function. And on re-reading the code I see that the job is invoked over _items_, so really the loop is over items and then users, despite the names in the code. That is why there is so little input to this job -- they're item IDs. So, choosing the sampling rate based on the number of reducers is a little problematic, but reasonable. However, the number of reducers you have may be suitable for the size of your users, but not the size of the items, which may be very different. This is a bit of a deeper suboptimality, since in your case your jobs have very different numbers of users and items. Normally it just means your item jobs in the iterations have more reducers than necessary, which is just a little extra overhead. But it has also manifested here as an actual problem for the way this convergence heuristic works. One option is to let the user override the sampling rate, but it seems like something the user shouldn't have to set. Another option is to expose control over the number of reducers for the user- and item-related jobs separately. That might be a good idea for reasons above, although it's a slightly unrelated issue. More directly, I'm going to look at ways to efficiently count the number of users and items and choose a sampling rate accordingly. If it's too low, nothing is sampled; if it's too high, the sampling takes a long time. I had hoped to avoid another job to just do this counting, but maybe there is an efficient way to figure it out. Let me do some homework.
... View more
06-12-2015
11:24 AM
Does this have ConvergenceSampleFn in the name? that's the bit of interest. If that's what you're looking at then this indicates that only 1190 users are in the input. Yes, we already know there are 0 output records and yes that is the problem. So now the question to me is, why is that happening? Stepping back, how much input is really going in to the first MapReduce jobs? is it really consistent with the data set size you expect, 7.5M users? You could browse the MR jobs to walk back and find out where the size of the data seems to diverge from what's normal. That's orders of magnitude different. that might help narrow down what's happening.
... View more
06-12-2015
09:01 AM
Sampling happens on every iteration. It has to record the current estimates for the same sampled users/items, and those change on each iteration. On the second iteration it's possible to compare the current vs previous sample estimates to assess convergence. Yes the sampling function is the same for both users and items; it's all in that function above. The next thing I'd check are statistics from the MapReduce job that runs for ConvergenceSampleFn. How many records went into the reducer and came out? I assume 0 were emitted, but I'm wondering if somehow it's running on just a small set of the data. That would at least explain it but I don't know if that's the case. You should see about 7.5M records into the reducer, I believe.
... View more