Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3451 | 01-26-2018 04:02 AM | |
7090 | 12-22-2017 09:18 AM | |
3538 | 12-05-2017 06:13 AM | |
3857 | 10-16-2017 07:55 AM | |
11231 | 10-04-2017 08:08 PM |
06-12-2015
02:54 AM
Yes, because the sampling rule is deterministic, the same IDs are sampled each time. I'm fairly stumped by this one, as I can't make out why your user IDs would never get sampled. Clearly it's something to do with the modulus since different smaller values work. But it makes little sense unless your ID's hash values weren't uniform, but they are hashed as strings. Is it possible to compute the hash code of the string representation of all of your IDs and see how many are 0 mod 3673? at least that would rule in or out some basic things.
... View more
06-11-2015
11:21 PM
Yes, that's strange, since we should see about 1/3673 IDs pass this check. Here's a quick demo of the same idea from some Scala one-liners: val r = new scala.util.Random (0 to 10000000).par.count(x => r.nextLong.toString.hashCode % 3673 == 0) 2854 10000000/2854 3503 About 3503 are expected and we get 2854. The idea ought to be sound. How much input do you have -- how many user IDs? it's a reasonably large number right? Sampling is simply relying on uniformity of the distribution of the hash code, which is fine. Yes, the problem was that IDs are not uniform sometimes, but the hashCode should always fix that. Yes, sampling is per iteration and samples the same IDs each time. The sampling size is chosen to try to scale up with the input size but it doesn't know the input size, so it's proxied by the number of reducers. This is an empirically determined formula.
... View more
06-10-2015
10:55 PM
That's good that it works at a different value, but I can't figure out why that would be. Obviously it has something to do with the IDs. The two extra log statements in ConvergenceSampleFn will print all of their hash codes: @Override public void process(Pair<Long, float[]> input, Emitter<String> emitter) { String userIDString = input.first().toString(); log.info(Integer.toString(userIDString.hashCode())); if (userIDString.hashCode() % convergenceSamplingModulus == 0) { float[] xu = input.second(); for (LongObjectMap.MapEntry<float[]> entry : yState.getY().entrySet()) { long itemID = entry.getKey(); log.info(Integer.toString(Long.toString(itemID).hashCode())); if (Long.toString(itemID).hashCode() % convergenceSamplingModulus == 0) { float estimate = (float) SimpleVectorMath.dot(xu, entry.getValue()); emitter.emit(DelimitedDataUtils.encode(',', userIDString, itemID, estimate)); } } } }
... View more
06-10-2015
10:14 PM
OK, then it was a reasonable fix but it actually would not have affected you anyway given that your IDs are strings. I can't see why it wouldn't sample any of the IDs. Their string hashCode ought to be fairly well distributed, so you should get reasonably close to the desired fraction of IDs sampled. You see the "Yconvergence" dir, so the right jobs are running, but there's no output (just _SUCCESS), which suggests that everything is working except not outputting IDs. I'd like to know what happens on these IDs inside ConvergenceSampleFn, but I know you can't share the IDs. I wonder if it's possible to run just that snippet of code on a bunch of IDs to understand what they hash to? or to toss in a few logging statements and re-run on your end to see what happens? @Override public void process(Pair<Long, float[]> input, Emitter<String> emitter) { String userIDString = input.first().toString(); if (userIDString.hashCode() % convergenceSamplingModulus == 0) { float[] xu = input.second(); for (LongObjectMap.MapEntry<float[]> entry : yState.getY().entrySet()) { long itemID = entry.getKey(); if (Long.toString(itemID).hashCode() % convergenceSamplingModulus == 0) { float estimate = (float) SimpleVectorMath.dot(xu, entry.getValue()); emitter.emit(DelimitedDataUtils.encode(',', userIDString, itemID, estimate)); } } } }
... View more
06-10-2015
03:26 PM
Hm, do you see a message like "Using convergence sampling modulus ... "? What are your IDs like? like, literally can you show a few examples? That was a good guess but it may not be the issue.
... View more
06-10-2015
08:39 AM
Yes, your IDs. Often they are internally hashed anyway, but if your IDs are already numeric, they are not hashed. But there's no good reason to expect they are evenly distributed. So the simple deterministic sample here doesn't work (sample 1/n of data by taking anything whose value is 0 mod n), because it fails to sample anything. An extra hashing in here should fix that. In one VM there is no need to do this sampling since all data is available easily in memory. This mechanism is an efficient equivalent for data-parallel Hadoop-based computation. Java 7 vs 8 doesn't matter. I was asking because I was about to release 1.1.0 and can add my fix, but it requires Java 7, so was figuring out whether that would work for you. Convergence is usually 20-40 iterations at most. But you should not need to set a fixed value. WOuld you be able to test a new build from source?
... View more
06-10-2015
08:15 AM
Yes, in general you may wish to use fewer reducers if your data is small and more if it's large, though it's more of a tuning issue than necessary to make it work. In general this problem has nothing to do with the number of reducers; I was guessing at a corner case and it's not relevant here. There isn't a special setting to know about like making it a prime, no. What I do think is happening is that the simple sampling rule isn't quite right, since it will depend to some degree on the distribution of your IDs, and there's not a good reason to expect an even distribution. Specifically, I suspect none of your IDs are 0 mod 3673. I think there needs to be an extra hash in here. By the way are you using Java 7? You don't have to, just checking.
... View more
06-10-2015
02:24 AM
We are talking about version 1.x here? Yes, while you wouldn't expect identical output from any two runs, and there are some computation difference in local vs Hadoop, I would not expect such a large difference. You are correct that the problem is that it couldn't pick any data for testing convergence. Is it writing "Yconvergence" temp directories with data? how many reducers do you have? I think the heuristic would fall down if you had a lot of reducers and very little data. Do you see messages like "Sampling for convergence where user/item ID == 0 % ..."?
... View more
06-02-2015
08:34 AM
Yes, the number of splits and therefore Mapper tasks is determined by Hadoop MapReduce and this is not altered or overridden. 11 is a default number of Reducer tasks which you can change. (For various reasons a prime number is a good choice.) Yes, you will see as many run simultaneously as you have reducer slots. This is determined by MapReduce and defaults to 1 per machine but can be changed if you know the machine can handle many more. This is all just Hadoop machinery, yeah, not specific to this app.
... View more
05-30-2015
04:13 AM
As I say, I don't think memory helps unless you are memory bound. It does not increase performance. You should let hadoop choose the number of mappers in general. I think it would be more helpful to know anything about your data and problem in order to recommend where to look. It sounds like your data is so small that this is all Hadoop overhead, and 'tuning' doesn't help in that it does not reflect how a large data set would behave.
... View more