About srowen

srowen · ‎06-12-2015

Yes, because the sampling rule is deterministic, the same IDs are sampled each time. I'm fairly stumped by this one, as I can't make out why your user IDs would never get sampled. Clearly it's something to do with the modulus since different smaller values work. But it makes little sense unless your ID's hash values weren't uniform, but they are hashed as strings. Is it possible to compute the hash code of the string representation of all of your IDs and see how many are 0 mod 3673? at least that would rule in or out some basic things.

srowen · ‎06-11-2015

Yes, that's strange, since we should see about 1/3673 IDs pass this check. Here's a quick demo of the same idea from some Scala one-liners: val r = new scala.util.Random (0 to 10000000).par.count(x => r.nextLong.toString.hashCode % 3673 == 0) 2854 10000000/2854 3503 About 3503 are expected and we get 2854. The idea ought to be sound. How much input do you have -- how many user IDs? it's a reasonably large number right? Sampling is simply relying on uniformity of the distribution of the hash code, which is fine. Yes, the problem was that IDs are not uniform sometimes, but the hashCode should always fix that. Yes, sampling is per iteration and samples the same IDs each time. The sampling size is chosen to try to scale up with the input size but it doesn't know the input size, so it's proxied by the number of reducers. This is an empirically determined formula.

srowen · ‎06-10-2015

That's good that it works at a different value, but I can't figure out why that would be. Obviously it has something to do with the IDs. The two extra log statements in ConvergenceSampleFn will print all of their hash codes: @Override public void process(Pair<Long, float[]> input, Emitter<String> emitter) { String userIDString = input.first().toString(); log.info(Integer.toString(userIDString.hashCode())); if (userIDString.hashCode() % convergenceSamplingModulus == 0) { float[] xu = input.second(); for (LongObjectMap.MapEntry<float[]> entry : yState.getY().entrySet()) { long itemID = entry.getKey(); log.info(Integer.toString(Long.toString(itemID).hashCode())); if (Long.toString(itemID).hashCode() % convergenceSamplingModulus == 0) { float estimate = (float) SimpleVectorMath.dot(xu, entry.getValue()); emitter.emit(DelimitedDataUtils.encode(',', userIDString, itemID, estimate)); } } } }

srowen · ‎06-10-2015

OK, then it was a reasonable fix but it actually would not have affected you anyway given that your IDs are strings. I can't see why it wouldn't sample any of the IDs. Their string hashCode ought to be fairly well distributed, so you should get reasonably close to the desired fraction of IDs sampled. You see the "Yconvergence" dir, so the right jobs are running, but there's no output (just _SUCCESS), which suggests that everything is working except not outputting IDs. I'd like to know what happens on these IDs inside ConvergenceSampleFn, but I know you can't share the IDs. I wonder if it's possible to run just that snippet of code on a bunch of IDs to understand what they hash to? or to toss in a few logging statements and re-run on your end to see what happens? @Override public void process(Pair<Long, float[]> input, Emitter<String> emitter) { String userIDString = input.first().toString(); if (userIDString.hashCode() % convergenceSamplingModulus == 0) { float[] xu = input.second(); for (LongObjectMap.MapEntry<float[]> entry : yState.getY().entrySet()) { long itemID = entry.getKey(); if (Long.toString(itemID).hashCode() % convergenceSamplingModulus == 0) { float estimate = (float) SimpleVectorMath.dot(xu, entry.getValue()); emitter.emit(DelimitedDataUtils.encode(',', userIDString, itemID, estimate)); } } } }

srowen · ‎06-10-2015

Hm, do you see a message like "Using convergence sampling modulus ... "? What are your IDs like? like, literally can you show a few examples? That was a good guess but it may not be the issue.

srowen · ‎06-10-2015

Yes, your IDs. Often they are internally hashed anyway, but if your IDs are already numeric, they are not hashed. But there's no good reason to expect they are evenly distributed. So the simple deterministic sample here doesn't work (sample 1/n of data by taking anything whose value is 0 mod n), because it fails to sample anything. An extra hashing in here should fix that. In one VM there is no need to do this sampling since all data is available easily in memory. This mechanism is an efficient equivalent for data-parallel Hadoop-based computation. Java 7 vs 8 doesn't matter. I was asking because I was about to release 1.1.0 and can add my fix, but it requires Java 7, so was figuring out whether that would work for you. Convergence is usually 20-40 iterations at most. But you should not need to set a fixed value. WOuld you be able to test a new build from source?

srowen · ‎06-10-2015

Yes, in general you may wish to use fewer reducers if your data is small and more if it's large, though it's more of a tuning issue than necessary to make it work. In general this problem has nothing to do with the number of reducers; I was guessing at a corner case and it's not relevant here. There isn't a special setting to know about like making it a prime, no. What I do think is happening is that the simple sampling rule isn't quite right, since it will depend to some degree on the distribution of your IDs, and there's not a good reason to expect an even distribution. Specifically, I suspect none of your IDs are 0 mod 3673. I think there needs to be an extra hash in here. By the way are you using Java 7? You don't have to, just checking.

srowen · ‎06-10-2015

We are talking about version 1.x here? Yes, while you wouldn't expect identical output from any two runs, and there are some computation difference in local vs Hadoop, I would not expect such a large difference. You are correct that the problem is that it couldn't pick any data for testing convergence. Is it writing "Yconvergence" temp directories with data? how many reducers do you have? I think the heuristic would fall down if you had a lot of reducers and very little data. Do you see messages like "Sampling for convergence where user/item ID == 0 % ..."?

srowen · ‎06-02-2015

Yes, the number of splits and therefore Mapper tasks is determined by Hadoop MapReduce and this is not altered or overridden. 11 is a default number of Reducer tasks which you can change. (For various reasons a prime number is a good choice.) Yes, you will see as many run simultaneously as you have reducer slots. This is determined by MapReduce and defaults to 1 per machine but can be changed if you know the machine can handle many more. This is all just Hadoop machinery, yeah, not specific to this app.

srowen · ‎05-30-2015

As I say, I don't think memory helps unless you are memory bound. It does not increase performance. You should let hadoop choose the number of mappers in general. I think it would be more helpful to know anything about your data and problem in order to recommend where to look. It sounds like your data is so small that this is all Hadoop overhead, and 'tuning' doesn't help in that it does not reflect how a large data set would behave.

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Tuning Hadoop parameters with Oryx 1.0

Re: Tuning Hadoop parameters with Oryx 1.0