question Re: Oryx ALS running with Hadoop in Archives of Support Questions (Read Only)

Oryx ALS running with Hadoop

Jason.Chen — Fri, 16 Sep 2022 09:31:14 GMT

Sean,

We are running Oryx with Hadoop.

It is running to converge around iteration 13.

However, same dataset with same training parameters are running about 120-130 to converge in a single local VM

(that's not running with Hadoop).

This seems not make sense. I am thinking the iteration# does not depend on the platform (Hadoop or local one VM computation).

The iteration# is related to training parameter, threshold and initial value of Y.

In other words, I am expecting to see similar iteration# from Hadoop and single local VM.

When running in Hadoop, I noticed the following log message. It looks the convergence is in low iteration because no sample and it uses

"artificial convergence". I did not see the similar message in single local VM (it shows something like "Avg absolute difference in estimate vs prior iteration over 18163 samples: 0.20480296387324523"). So, I think this maybe the issue.

Any suggestion or thought why this happens ?

Tue Jun 09 22:14:38 PDT 2015 INFO No samples for convergence; using artificial convergence value: 6.103515625E-5
Tue Jun 09 22:14:38 PDT 2015 INFO Converged

Thanks.

Jason

Re: Oryx ALS running with Hadoop

srowen — Wed, 10 Jun 2015 09:24:38 GMT

We are talking about version 1.x here?

Yes, while you wouldn't expect identical output from any two runs, and there are some computation difference in local vs Hadoop, I would not expect such a large difference.

You are correct that the problem is that it couldn't pick any data for testing convergence. Is it writing "Yconvergence" temp directories with data? how many reducers do you have? I think the heuristic would fall down if you had a lot of reducers and very little data.

Do you see messages like "Sampling for convergence where user/item ID == 0 % ..."?

Re: Oryx ALS running with Hadoop

Jason.Chen — Wed, 10 Jun 2015 15:05:29 GMT

Thanks for your reply.

(1) Yes, Oryx 1.x (more precisely, Oryx 1.0.1)

(2) I checked "Yconvergence" temp. For example: When job "...0-8-Y-RowStep..." is running, I see there is "...00000/tmp/iterations/7/Yconvergence"

and only one file "_SUCCESS" inside. And there is no "...00000/tmp/iterations/8/Yconvergence"

(3) I use 30 reducers and testing data is about 3.5 GB (~7.x million users, ~ one thousand items; ~51 million events).
Hmm, it's interesting you indicated "...I think the heuristic would fall down if you had a lot of reducers and very little data"...
Do you mean when the data is small, I should reduce the reducers #? Is it because too many reducers will partition the "small" data to smaller

group for each reducer and so that it impacts the converge? Can you explain details ? So that I can share and discuss with my co-workers.

(4) How can I avoid this converge issue? Just decrease the reducers # ? Any suggestion on the "reasonable" setting based on the data size?

The training data will grow and we want to know how to dynamically adjust reducer # based on the data size, so that we gain good performance

when running big data in big cluster and we avoid the converge issue...

In general, in a big cluster, we want to allocate more reducers, so it uses the power of the cluster.

(5) Related to (4),

In our case, the #user and #events will grow significantly. BUT, not items, it will maybe stay about 1200-1500 items.

I am thinking to use more reducers in our bigger cluster to handle bigger data set. Given that our item# keeps small (although users# and

events# become big), will it still have the same converge issue (because item size is keeping small). This is my main concern.

(6) I use 10, 20, 30 when I adjust reducers#. Should I use prime number instead ? Will that help for the converge issue ?

(7) I did not see "Sampling for convergence where user/item ID == 0 % ...", but I saw the following log message almost for each iteration..
The numbers (3673 and 7.412388E-6%) in the log message in each iteration is the same...odd...
"Log: Using convergence sampling modulus 3673 to sample about 7.412388E-6% of all user-item pairs for convergence"

Thanks.

Jason

Re: Oryx ALS running with Hadoop

srowen — Wed, 10 Jun 2015 15:15:36 GMT

Yes, in general you may wish to use fewer reducers if your data is small and more if it's large, though it's more of a tuning issue than necessary to make it work. In general this problem has nothing to do with the number of reducers; I was guessing at a corner case and it's not relevant here. There isn't a special setting to know about like making it a prime, no. What I do think is happening is that the simple sampling rule isn't quite right, since it will depend to some degree on the distribution of your IDs, and there's not a good reason to expect an even distribution. Specifically, I suspect none of your IDs are 0 mod 3673. I think there needs to be an extra hash in here. By the way are you using Java 7? You don't have to, just checking.

Re: Oryx ALS running with Hadoop

Jason.Chen — Thu, 11 Jun 2015 08:19:33 GMT

Sean

(1) I tried both Java 7 and Java 8. It performs the same way for the converge issue.

(2) Can you explain a little bit about this "...none of your IDs are 0 mod 3673.." What's ID ? User IDs, items-IDs or both ?

(3) Why there is no such problem when running as a single VM ? The converge sampling rule is different from the ALS version in Hadoop?

Thanks again.

Jason

Re: Oryx ALS running with Hadoop

srowen — Wed, 10 Jun 2015 15:39:25 GMT

Yes, your IDs. Often they are internally hashed anyway, but if your IDs are already numeric, they are not hashed. But there's no good reason to expect they are evenly distributed. So the simple deterministic sample here doesn't work (sample 1/n of data by taking anything whose value is 0 mod n), because it fails to sample anything. An extra hashing in here should fix that. In one VM there is no need to do this sampling since all data is available easily in memory. This mechanism is an efficient equivalent for data-parallel Hadoop-based computation. Java 7 vs 8 doesn't matter. I was asking because I was about to release 1.1.0 and can add my fix, but it requires Java 7, so was figuring out whether that would work for you. Convergence is usually 20-40 iterations at most. But you should not need to set a fixed value. WOuld you be able to test a new build from source?

Re: Oryx ALS running with Hadoop

Jason.Chen — Wed, 10 Jun 2015 15:48:25 GMT

Sean,

Go it. Thanks.

Good to know that you can plan to fix this in Oryx 1.1.0 release.

Do you have idea about the timeline ?

I was able to build from your source (1.0.1) using Java 8 and I do not think there would be an issue to build 1.1.0 from the source.

Jason

Re: Oryx ALS running with Hadoop

Jason.Chen — Wed, 10 Jun 2015 22:16:37 GMT

Hi Sean,

I noticed that you have a new commit (https://github.com/cloudera/oryx/commit/bb8fddd052abcd89af13feef74bc5d1d5aeaf8cb).

It looks to address the no sampling hash issue.

Just let you know that I gave a try (I downloaded the code and compiled it with Java).

It seems still with the the same issue...

"....No samples for convergence; using artificial convergence value: 0.001953125....".

I use 30 reducers and I do notice (from the Oryx code base) the modular is related to the reducers#.

Jason

Re: Oryx ALS running with Hadoop

srowen — Wed, 10 Jun 2015 22:26:13 GMT

Hm, do you see a message like "Using convergence sampling modulus ... "?

What are your IDs like? like, literally can you show a few examples?

That was a good guess but it may not be the issue.

Re: Oryx ALS running with Hadoop

Jason.Chen — Thu, 11 Jun 2015 00:14:51 GMT

Sean,

Yes, I saw this message for each iteration... something like:

Using convergence sampling modulus 3673 to sample about 7.412388E-6% of all user-item pairs for convergence

I cannot share the exact IDs.. Share the format:

User-ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX (X is either alphebets or numbers)

Item ID: XXXXX_xxxxxxxx (X is either alphebets in upper case or numbers; and x is either alphebets in lower case or numbers).

Thanks.

Jason

Re: Oryx ALS running with Hadoop

srowen — Thu, 11 Jun 2015 05:14:50 GMT

OK, then it was a reasonable fix but it actually would not have affected you anyway given that your IDs are strings.

I can't see why it wouldn't sample any of the IDs. Their string hashCode ought to be fairly well distributed, so you should get reasonably close to the desired fraction of IDs sampled. You see the "Yconvergence" dir, so the right jobs are running, but there's no output (just _SUCCESS), which suggests that everything is working except not outputting IDs.

I'd like to know what happens on these IDs inside ConvergenceSampleFn, but I know you can't share the IDs. I wonder if it's possible to run just that snippet of code on a bunch of IDs to understand what they hash to? or to toss in a few logging statements and re-run on your end to see what happens?

@Override
public void process(Pair<Long, float[]> input, Emitter<String> emitter) {
  String userIDString = input.first().toString();
  if (userIDString.hashCode() % convergenceSamplingModulus == 0) {
    float[] xu = input.second();
    for (LongObjectMap.MapEntry<float[]> entry : yState.getY().entrySet()) {
      long itemID = entry.getKey();
      if (Long.toString(itemID).hashCode() % convergenceSamplingModulus == 0) {
        float estimate = (float) SimpleVectorMath.dot(xu, entry.getValue());
        emitter.emit(DelimitedDataUtils.encode(',', userIDString, itemID, estimate));
      }
    }
  }
}

Re: Oryx ALS running with Hadoop

Jason.Chen — Thu, 11 Jun 2015 05:30:05 GMT

Sean,

Thanks for the follow up.

Yes, I can try that. Can you insert the appropriate log.info into the codes you want me to try. So, it can log proper info for you to review.

Meanwhile, I did try to reduce the reducer# (from 30 to 10) and I noticed it did sample to calculate converge distance. I checked the code and

it looks reducer# is used to generate the modular number.

For example:

Avg absolute difference in estimate vs prior iteration over 2124 samples: 0.02002799961913492

Jason

Re: Oryx ALS running with Hadoop

srowen — Thu, 11 Jun 2015 05:55:24 GMT

That's good that it works at a different value, but I can't figure out why that would be. Obviously it has something to do with the IDs. The two extra log statements in ConvergenceSampleFn will print all of their hash codes:

@Override
public void process(Pair<Long, float[]> input, Emitter<String> emitter) {
  String userIDString = input.first().toString();
  log.info(Integer.toString(userIDString.hashCode()));
  if (userIDString.hashCode() % convergenceSamplingModulus == 0) {
    float[] xu = input.second();
    for (LongObjectMap.MapEntry<float[]> entry : yState.getY().entrySet()) {
      long itemID = entry.getKey();
      log.info(Integer.toString(Long.toString(itemID).hashCode()));
      if (Long.toString(itemID).hashCode() % convergenceSamplingModulus == 0) {
        float estimate = (float) SimpleVectorMath.dot(xu, entry.getValue());
        emitter.emit(DelimitedDataUtils.encode(',', userIDString, itemID, estimate));
      }
    }
  }
}

Re: Oryx ALS running with Hadoop

Jason.Chen — Fri, 12 Jun 2015 06:10:03 GMT

Sean,

(1)

Here includes some results:

INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 2111185186541130611 hashCode= 977794330

INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3174317317673160368 hashCode= 463078209
INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3174428972624599832 hashCode= 1617905253
INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3444764202548713566 hashCode= 1628781813
INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3653094543606133455 hashCode= 1773010709

(2) In the Hadoop log, I do not see any log info about the following. Based on this, it seems no ID passes

"if (userIDString.hashCode() % convergenceSamplingModulus == 0) " check...

log.info(Integer.toString(Long.toString(itemID).hashCode()));

(3) Can you in overall explain how the sampling is working ?

(a) Is it sampling in each reducer of each iteration ?

(b) When it samples, is it looping into all Long User IDs and Long Item IDs and then apply mod ? I saw you use hashCode in new code.

Oryx 1.0.1 uses Long IDs for mod...

Why you choose modulus in this way ?

Thanks.

Re: Oryx ALS running with Hadoop

srowen — Fri, 12 Jun 2015 06:21:54 GMT

Yes, that's strange, since we should see about 1/3673 IDs pass this check. Here's a quick demo of the same idea from some Scala one-liners:

val r = new scala.util.Random

(0 to 10000000).par.count(x => r.nextLong.toString.hashCode % 3673 == 0)

2854

10000000/2854

3503

About 3503 are expected and we get 2854. The idea ought to be sound. How much input do you have -- how many user IDs? it's a reasonably large number right?

Sampling is simply relying on uniformity of the distribution of the hash code, which is fine.

Yes, the problem was that IDs are not uniform sometimes, but the hashCode should always fix that.

Yes, sampling is per iteration and samples the same IDs each time.

The sampling size is chosen to try to scale up with the input size but it doesn't know the input size, so it's proxied by the number of reducers. This is an empirically determined formula.

Re: Oryx ALS running with Hadoop

Jason.Chen — Fri, 12 Jun 2015 07:01:53 GMT

hm... that's strange why no IDs passed.

We have 7.6 million user IDs...

Question on this "...Yes, sampling is per iteration and samples the same IDs each time...."

Give an example, there are 30 reducers

say, in iteration 3,

(1) In iteration 3 and reducer #1

It loops all the users IDs (and item IDs) inside this reducer #1

(2) In iteration 3 and reducer #2

It loops all the users IDs (and item IDs) inside this reducer #2

Then, after iteration 3, it saves the sampling IDs. Same sample IDs are then use in iteration #4 and

it compares the difference between the estimated values of this samples ?

Re: Oryx ALS running with Hadoop

srowen — Fri, 12 Jun 2015 09:54:53 GMT

Yes, because the sampling rule is deterministic, the same IDs are sampled each time.

I'm fairly stumped by this one, as I can't make out why your user IDs would never get sampled. Clearly it's something to do with the modulus since different smaller values work. But it makes little sense unless your ID's hash values weren't uniform, but they are hashed as strings.

Is it possible to compute the hash code of the string representation of all of your IDs and see how many are 0 mod 3673? at least that would rule in or out some basic things.

Re: Oryx ALS running with Hadoop

Jason.Chen — Fri, 12 Jun 2015 15:37:56 GMT

Sean,

Yes, we tried that..

We took the long IDs of the 7.5 million users (yes, the long ID is the one that Oryx generates by hashing) and about 2021 of them are

0 mod 3673.. So it looks right. It's odd it's not passing in Oryx. We have about 1200 items and the long ID mod 3673 gives us nothing

(no item long ID in 0 mod 3673)...

Some questions to follow.

(1) The sampling process is separate for user IDs and item IDs. Right?

(2) In my previous example, I use iteration #3 and #4 as example. On 2nd thought, I am thinking the sampling processing should

happen BEFORE the iteration 1 starts. Right ? I notice there are several "data pre-processing" step (e.g., MergeIDMappingStep). I am thinking

the sampling happened there (MergeIDMappingStep) and then the same sample IDs used across each iteration. So, I am confused that

the "hashcode log message" I provided is in each reducer of each iteration. Can you explain a little bit ?

Thanks.

Re: Oryx ALS running with Hadoop

srowen — Fri, 12 Jun 2015 16:01:44 GMT

Sampling happens on every iteration. It has to record the current estimates for the same sampled users/items, and those change on each iteration. On the second iteration it's possible to compare the current vs previous sample estimates to assess convergence. Yes the sampling function is the same for both users and items; it's all in that function above. The next thing I'd check are statistics from the MapReduce job that runs for ConvergenceSampleFn. How many records went into the reducer and came out? I assume 0 were emitted, but I'm wondering if somehow it's running on just a small set of the data. That would at least explain it but I don't know if that's the case. You should see about 7.5M records into the reducer, I believe.

Re: Oryx ALS running with Hadoop

Jason.Chen — Fri, 12 Jun 2015 16:44:29 GMT

Sean,

Can you explain a little bit where I can identify such info ?

I check one particular job status (a Y job named "....0-3-Y-RowStep...") from Hadoop UI... This is a job that uses 30 reducers and

failed to sampling..

I saw the "Map-Reduce Framework" counter information,

there are

(1) combine input records: all zeros in our case

(2) combine output records: all zeros in our case

(3) Map input records: 1190

(4) Map output records: 1190

(5) Reduce input records: 1190

(6) Reduce output records: 0

(A) Where else I should check ?

(B) I noticed that "Reduce output records=0", it looks not normal.

However, I also checked the job that uses 10 reducers and fine to sampling..It also with "Reduce output records=0". thought ?

Thanks.