About Jason.Chen

Jason.Chen · ‎06-10-2015

Hi Sean, I noticed that you have a new commit (https://github.com/cloudera/oryx/commit/bb8fddd052abcd89af13feef74bc5d1d5aeaf8cb). It looks to address the no sampling hash issue. Just let you know that I gave a try (I downloaded the code and compiled it with Java). It seems still with the the same issue... "....No samples for convergence; using artificial convergence value: 0.001953125....". I use 30 reducers and I do notice (from the Oryx code base) the modular is related to the reducers#. Jason

Jason.Chen · ‎06-10-2015

Sean, Go it. Thanks. Good to know that you can plan to fix this in Oryx 1.1.0 release. Do you have idea about the timeline ? I was able to build from your source (1.0.1) using Java 8 and I do not think there would be an issue to build 1.1.0 from the source. Jason

Jason.Chen · ‎06-10-2015

Sean (1) I tried both Java 7 and Java 8. It performs the same way for the converge issue. (2) Can you explain a little bit about this "...none of your IDs are 0 mod 3673.." What's ID ? User IDs, items-IDs or both ? (3) Why there is no such problem when running as a single VM ? The converge sampling rule is different from the ALS version in Hadoop? Thanks again. Jason

Jason.Chen · ‎06-10-2015

Thanks for your reply. (1) Yes, Oryx 1.x (more precisely, Oryx 1.0.1) (2) I checked "Yconvergence" temp. For example: When job "...0-8-Y-RowStep..." is running, I see there is "...00000/tmp/iterations/7/Yconvergence" and only one file "_SUCCESS" inside. And there is no "...00000/tmp/iterations/8/Yconvergence" (3) I use 30 reducers and testing data is about 3.5 GB (~7.x million users, ~ one thousand items; ~51 million events). Hmm, it's interesting you indicated "...I think the heuristic would fall down if you had a lot of reducers and very little data"... Do you mean when the data is small, I should reduce the reducers #? Is it because too many reducers will partition the "small" data to smaller group for each reducer and so that it impacts the converge? Can you explain details ? So that I can share and discuss with my co-workers. (4) How can I avoid this converge issue? Just decrease the reducers # ? Any suggestion on the "reasonable" setting based on the data size? The training data will grow and we want to know how to dynamically adjust reducer # based on the data size, so that we gain good performance when running big data in big cluster and we avoid the converge issue... In general, in a big cluster, we want to allocate more reducers, so it uses the power of the cluster. (5) Related to (4), In our case, the #user and #events will grow significantly. BUT, not items, it will maybe stay about 1200-1500 items. I am thinking to use more reducers in our bigger cluster to handle bigger data set. Given that our item# keeps small (although users# and events# become big), will it still have the same converge issue (because item size is keeping small). This is my main concern. (6) I use 10, 20, 30 when I adjust reducers#. Should I use prime number instead ? Will that help for the converge issue ? (7) I did not see "Sampling for convergence where user/item ID == 0 % ...", but I saw the following log message almost for each iteration.. The numbers (3673 and 7.412388E-6%) in the log message in each iteration is the same...odd... "Log: Using convergence sampling modulus 3673 to sample about 7.412388E-6% of all user-item pairs for convergence" Thanks. Jason

Jason.Chen · ‎06-09-2015

Sean, We are running Oryx with Hadoop. It is running to converge around iteration 13. However, same dataset with same training parameters are running about 120-130 to converge in a single local VM (that's not running with Hadoop). This seems not make sense. I am thinking the iteration# does not depend on the platform (Hadoop or local one VM computation). The iteration# is related to training parameter, threshold and initial value of Y. In other words, I am expecting to see similar iteration# from Hadoop and single local VM. When running in Hadoop, I noticed the following log message. It looks the convergence is in low iteration because no sample and it uses "artificial convergence". I did not see the similar message in single local VM (it shows something like "Avg absolute difference in estimate vs prior iteration over 18163 samples: 0.20480296387324523"). So, I think this maybe the issue. Any suggestion or thought why this happens ? Tue Jun 09 22:14:38 PDT 2015 INFO No samples for convergence; using artificial convergence value: 6.103515625E-5 Tue Jun 09 22:14:38 PDT 2015 INFO Converged Thanks. Jason

Jason.Chen · ‎06-02-2015

Sean, Two more questions, as I checked Hadoop logs and Oryx computation logs.. We want to understand how Oryx computation works with Hadoop. (1) When it computes X or Y (with Hadoop), from the Oryx logs, it indicates for examples, "number of splits:2" and "Total input paths to process : 11" In the number determined by Hadoop automatically or it's determined by Oryx. I checked Oryx codes and cannot find those. (2) My question is that if inside Oryx codes, it controls how many reducers to run on each node simultaneously ? For example, "mapreduce.tasktracker.reduce.tasks.maximum" is overwritten...?

Jason.Chen · ‎05-30-2015

OK. Understood that the 3-4 GB data is so called "so small" to see the benefits using Hadoop (due to the overhead). We are collecting data and it grows fast. Will see if Hadoop based computation scales fine with much larger data. Thanks.

Jason.Chen · ‎05-29-2015

Sean, Follow up some scenarios I posted before, but post it in a separate thread... I am using Oryx 1.0 with Hadoop (CDH 5.4.1). It ran slow and I tuned the mapper-memory-mb and reducer-memory-mb.. Not helpful. Is is possible to tune Oryx config to (1) Tune the number of map and reduce tasks appropriately (2) Use LZO Compression for map output ? Thanks.

Jason.Chen · ‎05-28-2015

Sean, We are experimenting to use (a) single Computation node and (b) single Computation node plus a Hadoop cluster. We want to see the performance difference in terms of running time for (a) and (b) Questions: (1) What do you mean "There's no point in using Hadoop if you're just going to run on one machine." ? Our data will grow up fast and then we can not just use one VM (and continuously increasing memory). We think Hadoop MapReduce can help us to scale up when data grows. (2) Is tuning "mapper-memory-mb" and "reducer-memory-mb" potentially the way to "speed up" the process, as it allocates more MEM ? Thanks. Jason

Jason.Chen · ‎05-27-2015

Sean, Yes, we tried GC and it helped to identify the MEM usage. We also try to investigate the MEM used by Oryx computation and use Hadoop for model computation. (1) How to compute those more precisely to know the MEM needed ? (2) We also ran Oryx in Hadoop and it runs very slow. Good thing about Hadoop is that it can avoid the Out-of-MEM, but we do want to address the slow computation of Hadoop. So, my question is that if any suggestions to tune Hadoop stuff in Oryx config (say, mapper-memory-mb, reducer-memory-mb ?). (3) We heard that Oryx 2.0 is using Spark and has built-in train-validation process. It looks will help to address the issues I mentioned in (2) ? Thanks for your time. Jason

Online	Offline
Last Visited	‎07-06-2015 01:40 AM

Member Since	‎07-18-2014 11:03 PM
Last Visited	‎07-06-2015 01:40 AM
Posts	74

Cloudera Community

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Oryx ALS running with Hadoop

Re: Tuning Hadoop parameters with Oryx 1.0

Re: Tuning Hadoop parameters with Oryx 1.0

Tuning Hadoop parameters with Oryx 1.0

Re: Oryx log info of ALS

Re: Oryx log info of ALS