Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Oryx ALS running with Hadoop

avatar
Explorer

Sean,

 

We are running Oryx with Hadoop.

It is running to converge around iteration 13.

However, same dataset with same training parameters are running about 120-130 to converge in a single local VM

(that's not running with Hadoop).

 

This seems not make sense. I am thinking the iteration# does not depend on the platform (Hadoop or local one VM computation).

The iteration# is related to training parameter, threshold and initial value of Y.

In other words, I am expecting to see similar iteration# from Hadoop and single local VM.

 

When running in Hadoop, I noticed the following log message. It looks the convergence is in low iteration because no sample and it uses

"artificial convergence". I did not see the similar message in single local VM (it shows something like "Avg absolute difference in estimate vs prior iteration over 18163 samples: 0.20480296387324523"). So, I think this maybe the issue.

Any suggestion or thought why this happens ?

 

Tue Jun 09 22:14:38 PDT 2015 INFO No samples for convergence; using artificial convergence value: 6.103515625E-5
Tue Jun 09 22:14:38 PDT 2015 INFO Converged

 

Thanks.

 

Jason

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Ah OK, I think I understand this now. I made two little mistakes here. First was overlooking that you have actually a small number of items -- about a thousand right? which matches the number of records into the sampling function. And on re-reading the code I see that the job is invoked over _items_, so really the loop is over items and then users, despite the names in the code. That is why there is so little input to this job -- they're item IDs.

 

So, choosing the sampling rate based on the number of reducers is a little problematic, but reasonable. However, the number of reducers you have may be suitable for the size of your users, but not the size of the items, which may be very different. This is a bit of a deeper suboptimality, since in your case your jobs have very different numbers of users and items. Normally it just means your item jobs in the iterations have more reducers than necessary, which is just a little extra overhead. But it has also manifested here as an actual problem for the way this convergence heuristic works.

 

One option is to let the user override the sampling rate, but it seems like something the user shouldn't have to set.

Another option is to expose control over the number of reducers for the user- and item-related jobs separately. That might be a good idea for reasons above, although it's a slightly unrelated issue.

 

More directly, I'm going to look at ways to efficiently count the number of users and items and choose a sampling rate accordingly. If it's too low, nothing is sampled; if it's too high, the sampling takes a long time. I had hoped to avoid another job to just do this counting, but maybe there is an efficient way to figure it out.

 

Let me do some homework.

View solution in original post

27 REPLIES 27

avatar
Master Collaborator

Does this have ConvergenceSampleFn in the name? that's the bit of interest. If that's what you're looking at then this indicates that only 1190 users are in the input. Yes, we already know there are 0 output records and yes that is the problem.

 

So now the question to me is, why is that happening? Stepping back, how much input is really going in to the first MapReduce jobs? is it really consistent with the data set size you expect, 7.5M users? You could browse the MR jobs to walk back and find out where the size of the data seems to diverge from what's normal. That's orders of magnitude different. that might help narrow down what's happening.

avatar
Explorer

Oh, 1190 is item# (job of Y)

7.3 million users seems in the input to X job.

 

Yes, we are tracing the input size to see what happened...

 

Meanwhile, side questions:

(1) Comparing single VM and Hadoop, is it the same Hash function you use to hash IDs (user IDs and item IDs) to long IDs?

(2) Before calling ConvergenceSampleFn, is there another pre-processing could possibly cut the IDs down ? I traced some codes, but cannot identify those.

 

Thanks

avatar
Master Collaborator

Ah OK, I think I understand this now. I made two little mistakes here. First was overlooking that you have actually a small number of items -- about a thousand right? which matches the number of records into the sampling function. And on re-reading the code I see that the job is invoked over _items_, so really the loop is over items and then users, despite the names in the code. That is why there is so little input to this job -- they're item IDs.

 

So, choosing the sampling rate based on the number of reducers is a little problematic, but reasonable. However, the number of reducers you have may be suitable for the size of your users, but not the size of the items, which may be very different. This is a bit of a deeper suboptimality, since in your case your jobs have very different numbers of users and items. Normally it just means your item jobs in the iterations have more reducers than necessary, which is just a little extra overhead. But it has also manifested here as an actual problem for the way this convergence heuristic works.

 

One option is to let the user override the sampling rate, but it seems like something the user shouldn't have to set.

Another option is to expose control over the number of reducers for the user- and item-related jobs separately. That might be a good idea for reasons above, although it's a slightly unrelated issue.

 

More directly, I'm going to look at ways to efficiently count the number of users and items and choose a sampling rate accordingly. If it's too low, nothing is sampled; if it's too high, the sampling takes a long time. I had hoped to avoid another job to just do this counting, but maybe there is an efficient way to figure it out.

 

Let me do some homework.

avatar
Master Collaborator

I have a new branch with a better approach: https://github.com/cloudera/oryx/tree/Issue112 Are you able to build and try this branch? i can send you a binary too.

avatar
Explorer

Sean,

 

Thanks so much.

Yes, we would like to make a try!

 

Since I am travelling, I will pass to my co-workers (Ying Lu or Jinsu Oh) to try it out.

 

Thanks a lot.

 

Jason

avatar
Master Collaborator

Sure guys, let me know if it seems to work. Once this is resolved I am going to cut a 1.1.0 release.

avatar
New Contributor

Hi Sean,

 

I tried but I am having the following error: 

 

Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by /tmp/snappy-1.0.5-libsnappyjava.so)

at java.lang.ClassLoader$NativeLibrary.load(Native Method)

at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1929)

at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1814)

at java.lang.Runtime.load0(Runtime.java:809)

at java.lang.System.load(System.java:1083)

at org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39)

... 32 more

 

Do you have any ideas how to fix this error?

 

Thanks!

 

Ying

 

 

 

 

avatar
Master Collaborator

I think this is an issue with your installation of the native Snappy libs in your environment. The native snappy code isn't finding the right libstd on your system. You'll either need to address that, or remove snappy.