Support Questions

Jason.Chen · ‎06-09-2015

Sean,

We are running Oryx with Hadoop.

It is running to converge around iteration 13.

However, same dataset with same training parameters are running about 120-130 to converge in a single local VM

(that's not running with Hadoop).

This seems not make sense. I am thinking the iteration# does not depend on the platform (Hadoop or local one VM computation).

The iteration# is related to training parameter, threshold and initial value of Y.

In other words, I am expecting to see similar iteration# from Hadoop and single local VM.

When running in Hadoop, I noticed the following log message. It looks the convergence is in low iteration because no sample and it uses

"artificial convergence". I did not see the similar message in single local VM (it shows something like "Avg absolute difference in estimate vs prior iteration over 18163 samples: 0.20480296387324523"). So, I think this maybe the issue.

Any suggestion or thought why this happens ?

Tue Jun 09 22:14:38 PDT 2015 INFO No samples for convergence; using artificial convergence value: 6.103515625E-5
Tue Jun 09 22:14:38 PDT 2015 INFO Converged

Thanks.

Jason

srowen · ‎06-12-2015

Ah OK, I think I understand this now. I made two little mistakes here. First was overlooking that you have actually a small number of items -- about a thousand right? which matches the number of records into the sampling function. And on re-reading the code I see that the job is invoked over _items_, so really the loop is over items and then users, despite the names in the code. That is why there is so little input to this job -- they're item IDs.

So, choosing the sampling rate based on the number of reducers is a little problematic, but reasonable. However, the number of reducers you have may be suitable for the size of your users, but not the size of the items, which may be very different. This is a bit of a deeper suboptimality, since in your case your jobs have very different numbers of users and items. Normally it just means your item jobs in the iterations have more reducers than necessary, which is just a little extra overhead. But it has also manifested here as an actual problem for the way this convergence heuristic works.

One option is to let the user override the sampling rate, but it seems like something the user shouldn't have to set.

Another option is to expose control over the number of reducers for the user- and item-related jobs separately. That might be a good idea for reasons above, although it's a slightly unrelated issue.

More directly, I'm going to look at ways to efficiently count the number of users and items and choose a sampling rate accordingly. If it's too low, nothing is sampled; if it's too high, the sampling takes a long time. I had hoped to avoid another job to just do this counting, but maybe there is an efficient way to figure it out.

Let me do some homework.

View solution in original post

srowen · ‎06-10-2015

We are talking about version 1.x here?

Yes, while you wouldn't expect identical output from any two runs, and there are some computation difference in local vs Hadoop, I would not expect such a large difference.

You are correct that the problem is that it couldn't pick any data for testing convergence. Is it writing "Yconvergence" temp directories with data? how many reducers do you have? I think the heuristic would fall down if you had a lot of reducers and very little data.

Do you see messages like "Sampling for convergence where user/item ID == 0 % ..."?

Jason.Chen · ‎06-10-2015

Thanks for your reply.

(1) Yes, Oryx 1.x (more precisely, Oryx 1.0.1)

(2) I checked "Yconvergence" temp. For example: When job "...0-8-Y-RowStep..." is running, I see there is "...00000/tmp/iterations/7/Yconvergence"

and only one file "_SUCCESS" inside. And there is no "...00000/tmp/iterations/8/Yconvergence"

(3) I use 30 reducers and testing data is about 3.5 GB (~7.x million users, ~ one thousand items; ~51 million events).
Hmm, it's interesting you indicated "...I think the heuristic would fall down if you had a lot of reducers and very little data"...
Do you mean when the data is small, I should reduce the reducers #? Is it because too many reducers will partition the "small" data to smaller

group for each reducer and so that it impacts the converge? Can you explain details ? So that I can share and discuss with my co-workers.

(4) How can I avoid this converge issue? Just decrease the reducers # ? Any suggestion on the "reasonable" setting based on the data size?

The training data will grow and we want to know how to dynamically adjust reducer # based on the data size, so that we gain good performance

when running big data in big cluster and we avoid the converge issue...

In general, in a big cluster, we want to allocate more reducers, so it uses the power of the cluster.

(5) Related to (4),

In our case, the #user and #events will grow significantly. BUT, not items, it will maybe stay about 1200-1500 items.

I am thinking to use more reducers in our bigger cluster to handle bigger data set. Given that our item# keeps small (although users# and

events# become big), will it still have the same converge issue (because item size is keeping small). This is my main concern.

(6) I use 10, 20, 30 when I adjust reducers#. Should I use prime number instead ? Will that help for the converge issue ?

(7) I did not see "Sampling for convergence where user/item ID == 0 % ...", but I saw the following log message almost for each iteration..
The numbers (3673 and 7.412388E-6%) in the log message in each iteration is the same...odd...
"Log: Using convergence sampling modulus 3673 to sample about 7.412388E-6% of all user-item pairs for convergence"

Thanks.

Jason

srowen · ‎06-10-2015

Yes, in general you may wish to use fewer reducers if your data is small and more if it's large, though it's more of a tuning issue than necessary to make it work. In general this problem has nothing to do with the number of reducers; I was guessing at a corner case and it's not relevant here. There isn't a special setting to know about like making it a prime, no. What I do think is happening is that the simple sampling rule isn't quite right, since it will depend to some degree on the distribution of your IDs, and there's not a good reason to expect an even distribution. Specifically, I suspect none of your IDs are 0 mod 3673. I think there needs to be an extra hash in here. By the way are you using Java 7? You don't have to, just checking.

Jason.Chen · ‎06-10-2015

Sean

(1) I tried both Java 7 and Java 8. It performs the same way for the converge issue.

(2) Can you explain a little bit about this "...none of your IDs are 0 mod 3673.." What's ID ? User IDs, items-IDs or both ?

(3) Why there is no such problem when running as a single VM ? The converge sampling rule is different from the ALS version in Hadoop?

Thanks again.

Jason

srowen · ‎06-10-2015

Yes, your IDs. Often they are internally hashed anyway, but if your IDs are already numeric, they are not hashed. But there's no good reason to expect they are evenly distributed. So the simple deterministic sample here doesn't work (sample 1/n of data by taking anything whose value is 0 mod n), because it fails to sample anything. An extra hashing in here should fix that. In one VM there is no need to do this sampling since all data is available easily in memory. This mechanism is an efficient equivalent for data-parallel Hadoop-based computation. Java 7 vs 8 doesn't matter. I was asking because I was about to release 1.1.0 and can add my fix, but it requires Java 7, so was figuring out whether that would work for you. Convergence is usually 20-40 iterations at most. But you should not need to set a fixed value. WOuld you be able to test a new build from source?

Jason.Chen · ‎06-10-2015

Sean,

Go it. Thanks.

Good to know that you can plan to fix this in Oryx 1.1.0 release.

Do you have idea about the timeline ?

I was able to build from your source (1.0.1) using Java 8 and I do not think there would be an issue to build 1.1.0 from the source.

Jason

Jason.Chen · ‎06-10-2015

Hi Sean,

I noticed that you have a new commit (https://github.com/cloudera/oryx/commit/bb8fddd052abcd89af13feef74bc5d1d5aeaf8cb).

It looks to address the no sampling hash issue.

Just let you know that I gave a try (I downloaded the code and compiled it with Java).

It seems still with the the same issue...

"....No samples for convergence; using artificial convergence value: 0.001953125....".

I use 30 reducers and I do notice (from the Oryx code base) the modular is related to the reducers#.

Jason

srowen · ‎06-10-2015

Hm, do you see a message like "Using convergence sampling modulus ... "?

What are your IDs like? like, literally can you show a few examples?

That was a good guess but it may not be the issue.

Jason.Chen · ‎06-10-2015

Sean,

Yes, I saw this message for each iteration... something like:

Using convergence sampling modulus 3673 to sample about 7.412388E-6% of all user-item pairs for convergence

I cannot share the exact IDs.. Share the format:

User-ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX (X is either alphebets or numbers)

Item ID: XXXXX_xxxxxxxx (X is either alphebets in upper case or numbers; and x is either alphebets in lower case or numbers).

Thanks.

Jason

Cloudera Community

Support Questions

Oryx ALS running with Hadoop