<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Oryx ALS running with Hadoop in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28418#M6228</link>
    <description>&lt;P&gt;Hi Sean,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I noticed that you have a new commit (&lt;A target="_blank" href="https://github.com/cloudera/oryx/commit/bb8fddd052abcd89af13feef74bc5d1d5aeaf8cb)."&gt;https://github.com/cloudera/oryx/commit/bb8fddd052abcd89af13feef74bc5d1d5aeaf8cb).&lt;/A&gt;&lt;/P&gt;&lt;P&gt;It looks to address the no sampling hash issue.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Just let you know that I gave a try (I downloaded the code and compiled it with Java).&lt;/P&gt;&lt;P&gt;It seems still with the the same issue...&lt;/P&gt;&lt;P&gt;"....No samples for convergence; using artificial convergence value: 0.001953125....".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I use 30 reducers and I do notice (from the Oryx code base) the modular is related to the reducers#.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jason&lt;/P&gt;</description>
    <pubDate>Wed, 10 Jun 2015 22:16:37 GMT</pubDate>
    <dc:creator>Jason.Chen</dc:creator>
    <dc:date>2015-06-10T22:16:37Z</dc:date>
    <item>
      <title>Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28367#M6221</link>
      <description>&lt;P&gt;Sean,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We are running Oryx with Hadoop.&lt;/P&gt;&lt;P&gt;It is running to converge around iteration 13.&lt;/P&gt;&lt;P&gt;However, same dataset with same training parameters are running about 120-130 to converge in a single local VM&lt;/P&gt;&lt;P&gt;(that's not running with Hadoop).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This seems not make sense. I am thinking the iteration# does not depend on the platform (Hadoop or local one VM computation).&lt;/P&gt;&lt;P&gt;The iteration# is related to training parameter, threshold and initial value of Y.&lt;/P&gt;&lt;P&gt;In other words, I am expecting to see similar iteration# from Hadoop and single local VM.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;When running in Hadoop, I noticed the following log message. It looks the convergence is in low iteration because no sample and it uses&lt;/P&gt;&lt;P&gt;"artificial convergence". I did not see the similar message in single local VM (it shows something like "Avg absolute difference in estimate vs prior iteration over 18163 samples: 0.20480296387324523"). So, I think this maybe the issue.&lt;/P&gt;&lt;P&gt;Any suggestion or thought why this happens ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Tue Jun 09 22:14:38 PDT 2015 INFO No samples for convergence; using artificial convergence value: 6.103515625E-5&lt;BR /&gt;Tue Jun 09 22:14:38 PDT 2015 INFO Converged&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jason&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 09:31:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28367#M6221</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2022-09-16T09:31:14Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28375#M6222</link>
      <description>&lt;P&gt;We are talking about version 1.x here?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Yes, while you wouldn't expect identical output from any two runs, and there are some computation difference in local vs Hadoop, I would not expect such a large difference.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You are correct that the problem is that it couldn't pick&amp;nbsp;any data for testing convergence. Is it writing "Yconvergence" temp directories with data? how many reducers do you have? I think the heuristic would fall down if you had a lot of reducers and very little data.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Do you see messages like "Sampling for convergence where user/item ID == 0 % ..."?&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2015 09:24:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28375#M6222</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-06-10T09:24:38Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28389#M6223</link>
      <description>&lt;P&gt;Thanks for your reply.&lt;BR /&gt;&lt;BR /&gt;(1) Yes, Oryx 1.x (more precisely, Oryx 1.0.1)&lt;BR /&gt;&lt;BR /&gt;(2) I&amp;nbsp;checked "Yconvergence" temp. For example: When job "...&lt;SPAN&gt;0-8-Y-RowStep...&lt;/SPAN&gt;" is running, I see there is "...00000/tmp/iterations/7/Yconvergence"&lt;/P&gt;&lt;P&gt;&amp;nbsp;and only one file "_SUCCESS" inside. And&amp;nbsp;there is no "...00000/tmp/iterations/8/Yconvergence"&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;(3) I use 30 reducers and testing data is about 3.5 GB (~7.x million users, ~ one thousand items; ~51 million events).&lt;BR /&gt;Hmm, it's interesting you indicated "...I think the heuristic would fall down if you had a lot of reducers and very little data"...&lt;BR /&gt;Do you mean when the data is small, I should reduce the reducers #? Is it because too many reducers will partition the "small" data to smaller&lt;BR /&gt;&lt;BR /&gt;group for each reducer and so that it impacts the converge? Can you explain details ? So that I can share and discuss with my co-workers.&lt;BR /&gt;&lt;BR /&gt;(4) How can I avoid this converge issue? Just decrease the reducers # ? Any suggestion on the "reasonable" setting based on the data size?&lt;BR /&gt;&lt;BR /&gt;The training data will grow and we want to know how to dynamically adjust reducer # based on the data size, so that we gain good performance&lt;/P&gt;&lt;P&gt;when running big data in big cluster and we avoid the converge issue...&lt;BR /&gt;&lt;BR /&gt;In general, in a big cluster, we want to allocate more reducers, so it uses the power of the cluster.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;(5) Related to (4),&lt;/P&gt;&lt;P&gt;In our case, the #user and #events will grow significantly. BUT, not items, it will maybe stay about 1200-1500 items.&lt;/P&gt;&lt;P&gt;I am thinking to use more reducers in our bigger cluster to handle bigger data set. Given that our item# keeps small (although users# and&lt;/P&gt;&lt;P&gt;events# become big), will it still have the same converge issue (because item size is keeping small). This is my main concern.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;(6) I use 10, 20, 30 when I adjust reducers#. Should I use prime number instead ? Will that help for the converge issue ?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;(7) I did not see "Sampling for convergence where user/item ID == 0 % ...", but I saw the following log message almost for each iteration..&lt;BR /&gt;The numbers (3673 and 7.412388E-6%) in the log message in each iteration is the same...odd...&lt;BR /&gt;"Log: Using convergence sampling modulus 3673 to sample about 7.412388E-6% of all user-item pairs for convergence"&lt;BR /&gt;&lt;BR /&gt;Thanks.&lt;BR /&gt;&lt;BR /&gt;Jason&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2015 15:05:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28389#M6223</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2015-06-10T15:05:29Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28390#M6224</link>
      <description>&lt;P&gt;Yes, in general you may wish to use fewer reducers if your data is small and more if it's large, though it's more of a tuning issue than necessary to make it work. In general this problem has nothing to do with the number of reducers; I was guessing at a corner case and it's not relevant here. There isn't a special setting to know about like making it a prime, no. What I do think is happening is that the simple sampling rule isn't quite right, since it will depend to some degree on the distribution of your IDs, and there's not a good reason to expect an even distribution. Specifically, I suspect none of your IDs are 0 mod 3673. I think there needs to be an extra hash in here. By the way are you using Java 7? You don't have to, just checking.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2015 15:15:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28390#M6224</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-06-10T15:15:36Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28391#M6225</link>
      <description>&lt;P&gt;Sean&lt;/P&gt;&lt;P&gt;(1) I tried both Java 7 and Java 8. It performs the same way for the converge issue.&lt;/P&gt;&lt;P&gt;(2) Can you explain a little bit about this "...none of your IDs are 0 mod 3673.." What's ID ? User IDs, items-IDs or both ?&lt;/P&gt;&lt;P&gt;(3) Why there is no such problem when running as a single VM ? The converge sampling rule is different from the ALS version in Hadoop?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks again.&lt;/P&gt;&lt;P&gt;Jason&lt;/P&gt;</description>
      <pubDate>Thu, 11 Jun 2015 08:19:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28391#M6225</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2015-06-11T08:19:33Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28393#M6226</link>
      <description>&lt;P&gt;Yes, your IDs. Often they are internally hashed anyway, but if your IDs are already numeric, they are not hashed. But there's no good reason to expect they are evenly distributed. So the simple deterministic sample here doesn't work (sample 1/n of data by taking anything whose value is 0 mod n), because it fails to sample anything. An extra hashing in here should fix that. In one VM there is no need to do this sampling since all data is available easily in memory. This mechanism is an efficient equivalent for data-parallel Hadoop-based computation. Java 7 vs 8 doesn't matter. I was asking because I was about to release 1.1.0 and can add my fix, but it requires Java 7, so was figuring out whether that would work for you. Convergence is usually 20-40 iterations at most. But you should not need to set a fixed value. WOuld you be able to test a new build from source?&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2015 15:39:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28393#M6226</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-06-10T15:39:25Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28394#M6227</link>
      <description>&lt;P&gt;Sean,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Go it. Thanks.&lt;/P&gt;&lt;P&gt;Good to know that you can plan to fix this in Oryx 1.1.0 release.&lt;/P&gt;&lt;P&gt;Do you have idea about the timeline ?&lt;/P&gt;&lt;P&gt;I was able to build from your source (1.0.1) using Java 8 and I do not think there would be an issue to build 1.1.0 from the source.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jason&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2015 15:48:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28394#M6227</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2015-06-10T15:48:25Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28418#M6228</link>
      <description>&lt;P&gt;Hi Sean,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I noticed that you have a new commit (&lt;A target="_blank" href="https://github.com/cloudera/oryx/commit/bb8fddd052abcd89af13feef74bc5d1d5aeaf8cb)."&gt;https://github.com/cloudera/oryx/commit/bb8fddd052abcd89af13feef74bc5d1d5aeaf8cb).&lt;/A&gt;&lt;/P&gt;&lt;P&gt;It looks to address the no sampling hash issue.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Just let you know that I gave a try (I downloaded the code and compiled it with Java).&lt;/P&gt;&lt;P&gt;It seems still with the the same issue...&lt;/P&gt;&lt;P&gt;"....No samples for convergence; using artificial convergence value: 0.001953125....".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I use 30 reducers and I do notice (from the Oryx code base) the modular is related to the reducers#.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jason&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2015 22:16:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28418#M6228</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2015-06-10T22:16:37Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28419#M6229</link>
      <description>&lt;P&gt;Hm, do you see a message like "U&lt;SPAN&gt;sing convergence sampling modulus ... "?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;What are your IDs like? like, literally can you show a few examples?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;That was a good guess but it may not be the issue.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2015 22:26:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28419#M6229</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-06-10T22:26:13Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28422#M6230</link>
      <description>&lt;P&gt;Sean,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Yes, I saw this message for each iteration... something like:&lt;/P&gt;&lt;P&gt;Using convergence sampling modulus 3673 to sample about 7.412388E-6% of all user-item pairs for convergence&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I cannot share the exact IDs.. Share the format:&lt;/P&gt;&lt;P&gt;User-ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX&amp;nbsp; (X is either alphebets or numbers)&lt;/P&gt;&lt;P&gt;Item ID: XXXXX_xxxxxxxx (X is either alphebets in upper case or numbers; and x is either alphebets in lower case or numbers).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;Jason&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 11 Jun 2015 00:14:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28422#M6230</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2015-06-11T00:14:51Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28425#M6231</link>
      <description>&lt;P&gt;OK, then it was a reasonable fix but it actually would not have affected you anyway given that your IDs are strings.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I can't see why it wouldn't sample any of the IDs. Their string hashCode ought to be fairly well distributed, so you should get reasonably close to the desired fraction of IDs sampled. You see the "Yconvergence" dir, so the right jobs are running, but there's no output (just _SUCCESS), which suggests that everything is working except not outputting IDs.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'd like to know what happens on these IDs inside ConvergenceSampleFn, but I know you can't share the IDs. I wonder if it's possible to run just that snippet of code on a bunch of IDs to understand what they hash to? or to toss in a few logging statements and re-run on your end to see what happens?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN&gt;@Override&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;public void &lt;/SPAN&gt;process(Pair&amp;lt;Long, &lt;SPAN&gt;float&lt;/SPAN&gt;[]&amp;gt; input, Emitter&amp;lt;String&amp;gt; emitter) {&lt;BR /&gt;  String userIDString = input.first().toString();&lt;BR /&gt;  &lt;SPAN&gt;if &lt;/SPAN&gt;(userIDString.hashCode() % &lt;SPAN&gt;convergenceSamplingModulus &lt;/SPAN&gt;== &lt;SPAN&gt;0&lt;/SPAN&gt;) {&lt;BR /&gt;    &lt;SPAN&gt;float&lt;/SPAN&gt;[] xu = input.second();&lt;BR /&gt;    &lt;SPAN&gt;for &lt;/SPAN&gt;(LongObjectMap.MapEntry&amp;lt;&lt;SPAN&gt;float&lt;/SPAN&gt;[]&amp;gt; entry : &lt;SPAN&gt;yState&lt;/SPAN&gt;.getY().entrySet()) {&lt;BR /&gt;      &lt;SPAN&gt;long &lt;/SPAN&gt;itemID = entry.getKey();&lt;BR /&gt;      &lt;SPAN&gt;if &lt;/SPAN&gt;(Long.&lt;SPAN&gt;toString&lt;/SPAN&gt;(itemID).hashCode() % &lt;SPAN&gt;convergenceSamplingModulus &lt;/SPAN&gt;== &lt;SPAN&gt;0&lt;/SPAN&gt;) {&lt;BR /&gt;        &lt;SPAN&gt;float &lt;/SPAN&gt;estimate = (&lt;SPAN&gt;float&lt;/SPAN&gt;) SimpleVectorMath.&lt;SPAN&gt;dot&lt;/SPAN&gt;(xu, entry.getValue());&lt;BR /&gt;        emitter.emit(DelimitedDataUtils.&lt;SPAN&gt;encode&lt;/SPAN&gt;(&lt;SPAN&gt;','&lt;/SPAN&gt;, userIDString, itemID, estimate));&lt;BR /&gt;      }&lt;BR /&gt;    }&lt;BR /&gt;  }&lt;BR /&gt;}&lt;/PRE&gt;</description>
      <pubDate>Thu, 11 Jun 2015 05:14:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28425#M6231</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-06-11T05:14:50Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28427#M6232</link>
      <description>&lt;P&gt;Sean,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for the follow up.&lt;/P&gt;&lt;P&gt;Yes, I can try that. Can you insert the appropriate log.info into the codes you want me to try. So, it can log proper info for you to review.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Meanwhile, I did try to reduce the reducer# (from 30 to 10) and I noticed it did sample to calculate converge distance. I checked the code and&lt;/P&gt;&lt;P&gt;it looks reducer# is used to generate the modular number.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For example:&lt;/P&gt;&lt;P&gt;Avg absolute difference in estimate vs prior iteration over 2124 samples: 0.02002799961913492&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jason&lt;/P&gt;</description>
      <pubDate>Thu, 11 Jun 2015 05:30:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28427#M6232</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2015-06-11T05:30:05Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28428#M6233</link>
      <description>&lt;P&gt;That's good that it works at a different value, but I can't figure out why that would be. Obviously it has something to do with the IDs. The two extra log statements in ConvergenceSampleFn will print all of their hash codes:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN&gt;@Override&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;public void &lt;/SPAN&gt;process(Pair&amp;lt;Long, &lt;SPAN&gt;float&lt;/SPAN&gt;[]&amp;gt; input, Emitter&amp;lt;String&amp;gt; emitter) {&lt;BR /&gt;  String userIDString = input.first().toString();&lt;BR /&gt;  &lt;SPAN&gt;log&lt;/SPAN&gt;.info(Integer.&lt;SPAN&gt;toString&lt;/SPAN&gt;(userIDString.hashCode()));&lt;BR /&gt;  &lt;SPAN&gt;if &lt;/SPAN&gt;(userIDString.hashCode() % &lt;SPAN&gt;convergenceSamplingModulus &lt;/SPAN&gt;== &lt;SPAN&gt;0&lt;/SPAN&gt;) {&lt;BR /&gt;    &lt;SPAN&gt;float&lt;/SPAN&gt;[] xu = input.second();&lt;BR /&gt;    &lt;SPAN&gt;for &lt;/SPAN&gt;(LongObjectMap.MapEntry&amp;lt;&lt;SPAN&gt;float&lt;/SPAN&gt;[]&amp;gt; entry : &lt;SPAN&gt;yState&lt;/SPAN&gt;.getY().entrySet()) {&lt;BR /&gt;      &lt;SPAN&gt;long &lt;/SPAN&gt;itemID = entry.getKey();&lt;BR /&gt;      &lt;SPAN&gt;log&lt;/SPAN&gt;.info(Integer.&lt;SPAN&gt;toString&lt;/SPAN&gt;(Long.&lt;SPAN&gt;toString&lt;/SPAN&gt;(itemID).hashCode()));&lt;BR /&gt;      &lt;SPAN&gt;if &lt;/SPAN&gt;(Long.&lt;SPAN&gt;toString&lt;/SPAN&gt;(itemID).hashCode() % &lt;SPAN&gt;convergenceSamplingModulus &lt;/SPAN&gt;== &lt;SPAN&gt;0&lt;/SPAN&gt;) {&lt;BR /&gt;        &lt;SPAN&gt;float &lt;/SPAN&gt;estimate = (&lt;SPAN&gt;float&lt;/SPAN&gt;) SimpleVectorMath.&lt;SPAN&gt;dot&lt;/SPAN&gt;(xu, entry.getValue());&lt;BR /&gt;        emitter.emit(DelimitedDataUtils.&lt;SPAN&gt;encode&lt;/SPAN&gt;(&lt;SPAN&gt;','&lt;/SPAN&gt;, userIDString, itemID, estimate));&lt;BR /&gt;      }&lt;BR /&gt;    }&lt;BR /&gt;  }&lt;BR /&gt;}&lt;/PRE&gt;</description>
      <pubDate>Thu, 11 Jun 2015 05:55:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28428#M6233</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-06-11T05:55:24Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28480#M6234</link>
      <description>&lt;P&gt;Sean,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;(1)&lt;/P&gt;&lt;P&gt;Here includes some results:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 2111185186541130611 hashCode= 977794330&lt;/P&gt;&lt;P&gt;INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3174317317673160368 hashCode= 463078209&lt;BR /&gt;INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3174428972624599832 hashCode= 1617905253&lt;BR /&gt;INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3444764202548713566 hashCode= 1628781813&lt;BR /&gt;INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3653094543606133455 hashCode= 1773010709&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;(2) In the Hadoop log, I do not see any log info about the following. Based on this, it seems no ID passes&lt;/P&gt;&lt;P&gt;"if (userIDString.hashCode() % convergenceSamplingModulus == 0) " check...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN&gt;log&lt;/SPAN&gt;.info(Integer.&lt;SPAN&gt;toString&lt;/SPAN&gt;(Long.&lt;SPAN&gt;toString&lt;/SPAN&gt;(itemID).hashCode()));&lt;/PRE&gt;&lt;P&gt;(3) Can you in overall explain how the sampling is working&amp;nbsp; ?&lt;/P&gt;&lt;P&gt;(a) Is it sampling in each reducer of each iteration ?&lt;/P&gt;&lt;P&gt;(b) When it samples, is it looping into all Long User IDs and Long Item IDs and then apply mod ? I saw you use hashCode in new code.&lt;/P&gt;&lt;P&gt;Oryx 1.0.1 uses Long IDs for mod...&lt;/P&gt;&lt;P&gt;(c) int modulus = RandomUtils.nextTwinPrime(4 * opts.getNumReducers() * opts.getNumReducers());&lt;/P&gt;&lt;P&gt;Why you choose modulus in this way ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jun 2015 06:10:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28480#M6234</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2015-06-12T06:10:03Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28481#M6235</link>
      <description>&lt;P&gt;Yes, that's strange, since we should see about 1/3673 IDs pass this check. Here's a quick demo of the same idea from some Scala one-liners:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;val r = new scala.util.Random&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;(0 to 10000000).par.count(x =&amp;gt; r.nextLong.toString.hashCode % 3673 == 0)&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;2854&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;10000000/2854&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;3503&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;About 3503 are expected and we get 2854. The idea ought to be sound. How much input do you have -- how many user IDs? it's a reasonably large number right?&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Sampling is simply relying on uniformity of the distribution of the hash code, which is fine.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Yes, the problem was that IDs are not uniform sometimes, but the hashCode should always fix that.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Yes, sampling is per iteration and samples the same IDs each time.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;The sampling size is chosen to try to scale up with the input size but it doesn't know the input size, so it's proxied by the number of reducers. This is an empirically determined formula.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jun 2015 06:21:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28481#M6235</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-06-12T06:21:54Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28482#M6236</link>
      <description>&lt;P&gt;hm... that's strange why no IDs passed.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We have 7.6 million user IDs...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Question on this "...&lt;SPAN class="s1"&gt;Yes, sampling is per iteration and samples the same IDs each time.&lt;/SPAN&gt;..."&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Give an example, there are 30 reducers&lt;/P&gt;&lt;P&gt;say, in&amp;nbsp;iteration 3,&lt;/P&gt;&lt;P&gt;(1) In iteration 3 and&amp;nbsp;reducer #1&lt;/P&gt;&lt;P&gt;It loops all the users IDs (and item IDs) inside this reducer #1&lt;/P&gt;&lt;P&gt;(2) In iteration 3 and&amp;nbsp;reducer #2&lt;/P&gt;&lt;P&gt;It loops all the users IDs (and item IDs) inside this reducer #2&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then, after&amp;nbsp;iteration 3, it saves the sampling IDs. Same sample IDs are then use in&amp;nbsp;iteration #4 and&lt;/P&gt;&lt;P&gt;it compares the&amp;nbsp;difference between the estimated values of this samples ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jun 2015 07:01:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28482#M6236</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2015-06-12T07:01:53Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28485#M6237</link>
      <description>&lt;P&gt;Yes, because the sampling rule is deterministic, the same IDs are sampled each time.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm fairly stumped by this one, as I can't make out why your user IDs would never get sampled. Clearly it's something to do with the modulus since different smaller values work. But it makes little sense unless your ID's hash values weren't uniform, but they are hashed as strings.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is it possible to compute the hash code of the string representation of all of your IDs and see how many are 0 mod 3673? at least that would rule in or out some basic things.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jun 2015 09:54:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28485#M6237</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-06-12T09:54:53Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28495#M6238</link>
      <description>&lt;P&gt;Sean,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Yes, we tried that..&lt;/P&gt;&lt;P&gt;We took the long IDs of the 7.5 million users (yes, the long ID is the one that Oryx generates by hashing) and about 2021 of them are&amp;nbsp;&lt;/P&gt;&lt;P&gt;0 mod 3673..&amp;nbsp; So it looks right. It's odd it's not passing in Oryx. We have about 1200 items and the long ID mod 3673 gives us nothing&lt;/P&gt;&lt;P&gt;(no item long ID in 0 mod 3673)...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Some questions to follow.&lt;/P&gt;&lt;P&gt;(1) The sampling process is separate for user IDs and item IDs. Right?&lt;/P&gt;&lt;P&gt;(2) In my previous example, I use iteration #3 and #4 as example. On 2nd thought, I am thinking the sampling processing should&lt;/P&gt;&lt;P&gt;happen&amp;nbsp;BEFORE the iteration 1 starts. Right ? I notice there are several "data pre-processing" step (e.g., MergeIDMappingStep). I am thinking&lt;/P&gt;&lt;P&gt;the sampling&amp;nbsp;happened there (MergeIDMappingStep) and then the same sample IDs used across each iteration. So, I am confused that&lt;/P&gt;&lt;P&gt;the "hashcode log message" I provided is in each reducer of each iteration. Can you explain a&amp;nbsp;little bit ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jun 2015 15:37:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28495#M6238</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2015-06-12T15:37:56Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28497#M6239</link>
      <description>&lt;P&gt;Sampling happens on every iteration. It has to record the current estimates for the same sampled users/items, and those change on each iteration. On the second iteration it's possible to compare the current vs previous sample estimates to assess convergence. Yes the sampling function is the same for both users and items; it's all in that function above. The next thing I'd check are statistics from the MapReduce job that runs for ConvergenceSampleFn. How many records went into the reducer and came out? I assume 0 were emitted, but I'm wondering if somehow it's running on just a small set of the data. That would at least explain it but I don't know if that's the case. You should see about 7.5M records into the reducer, I believe.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jun 2015 16:01:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28497#M6239</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-06-12T16:01:44Z</dc:date>
    </item>
    <item>
      <title>Re: Oryx ALS running with Hadoop</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28498#M6240</link>
      <description>&lt;P&gt;Sean,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can you explain a little bit where I can identify such info ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I check one particular job status (a Y job named "....0-3-Y-RowStep...") from Hadoop UI... This is a job that uses 30 reducers and&lt;/P&gt;&lt;P&gt;failed to sampling..&lt;/P&gt;&lt;P&gt;I saw the "Map-Reduce Framework" counter information,&lt;/P&gt;&lt;P&gt;there are&lt;/P&gt;&lt;P&gt;(1) combine input records: all zeros in our case&lt;/P&gt;&lt;P&gt;(2) combine output records: all zeros in our case&lt;/P&gt;&lt;P&gt;(3) Map input records: 1190&lt;/P&gt;&lt;P&gt;(4) Map&amp;nbsp;output records: 1190&lt;/P&gt;&lt;P&gt;(5) Reduce input records: 1190&lt;/P&gt;&lt;P&gt;(6) Reduce output records: 0&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;(A) Where else I should check ?&lt;/P&gt;&lt;P&gt;(B) I noticed that "Reduce output records=0", it looks not normal.&lt;/P&gt;&lt;P&gt;However, I also checked the job that uses 10 reducers and fine to sampling..It also with "Reduce output records=0". thought ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jun 2015 16:44:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oryx-ALS-running-with-Hadoop/m-p/28498#M6240</guid>
      <dc:creator>Jason.Chen</dc:creator>
      <dc:date>2015-06-12T16:44:29Z</dc:date>
    </item>
  </channel>
</rss>

