About Jason.Chen

Jason.Chen · ‎06-28-2015

It drops the connection issue in RowStep.. One example of the detailed log is as below (I slightly modified the info to hide some sensitive server info, but it keeps main messages): One thing odd is that even it reports cannot reaching the server:port (say 10.190.36.114:40915) as below, it's still eventually completing the job. I am thinking maybe it completes with other nodes in a "standard" port? However, it's still not good a sign seeing cannot connect to server, because it introduces unnecessary running time. /// Logs //// Thu May 28 07:27:57 PDT 2015 INFO Running job "Oryx-/user/xyz/int/def-1-122-Y-RowStep: Avro(hdfs://server105:8020/u... ID=1 (1/1)" Thu May 28 07:27:57 PDT 2015 INFO Job status available at: http://server105:8088/proxy/application_1432750221048_0525/ Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3 Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3 Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 2 time(s); maxRetries=3 ... Thu May 28 07:34:15 PDT 2015 INFO Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server Thu May 28 07:34:16 PDT 2015 INFO Finished Oryx-/user/xyz/int/def-1-122-Y-RowStep Thu May 28 07:34:16 PDT 2015 INFO Completed RowStep in 379s

Jason.Chen · ‎06-28-2015

Sean, I have some follow-up questions regarding this topic ("Run Oryx on a machine that is not part of the cluster").. We started to test the case that Oryx 1.0 computation/serving layers running on VMs that are in different virtual LAN from the Hadoop Cluster. There are firewall port issues for the communication between the two virtual LANs. Therefore, we opened the all the Hadoop used ports on the Hadoop Cluster virtual LAN, so that the Oryx VMs can talk to it. We got the "Hadoop used port list" from both the Hadoop configuration files and also some online Cloudera CDH port info. After doing that, yes, Oryx is able to submit jobs to Hadoop cluster at some level. However, it still drops some communications issue. For example, from the Oryx log, I see something like this Retrying connect to server: server-name/10.190.36.113:40651. Already tried 0 time(s); maxRetries=3 Retrying connect to server: server-name/10.190.36.113:40651. Already tried 1 time(s); maxRetries=3 .... Retrying connect to server: server-name/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3 Retrying connect to server: server-name/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3 .... I dig into the codes and I do not understand why this could happen. My questions... (1) Is the communication between Oryx to Hadoop is bidirectional OR unidirectional? My understanding is that Oryx uses the Hadoop configuration files to get the idea where (server and port) it should submit the jobs. After Oryx submits the job, how Oryx knows the job is completed? Does Oryx check with Hadoop to get the status? Or, Hadoop communicates back to Oryx VM regarding the status ? (2) Related to (1) and the log info I post above: Are there "dynamic" ports are used during the Oryx-Hadoop communications? From the log message, I see ports 40651 and 40915.. They seem to not standard Hadoop ports and even these port numbers are dynamically changing. This is confusing. Thanks.

Jason.Chen · ‎06-13-2015

Sean, Thanks so much. Yes, we would like to make a try! Since I am travelling, I will pass to my co-workers (Ying Lu or Jinsu Oh) to try it out. Thanks a lot. Jason

Jason.Chen · ‎06-12-2015

Oh, 1190 is item# (job of Y) 7.3 million users seems in the input to X job. Yes, we are tracing the input size to see what happened... Meanwhile, side questions: (1) Comparing single VM and Hadoop, is it the same Hash function you use to hash IDs (user IDs and item IDs) to long IDs? (2) Before calling ConvergenceSampleFn, is there another pre-processing could possibly cut the IDs down ? I traced some codes, but cannot identify those. Thanks

Jason.Chen · ‎06-12-2015

Sean, Can you explain a little bit where I can identify such info ? I check one particular job status (a Y job named "....0-3-Y-RowStep...") from Hadoop UI... This is a job that uses 30 reducers and failed to sampling.. I saw the "Map-Reduce Framework" counter information, there are (1) combine input records: all zeros in our case (2) combine output records: all zeros in our case (3) Map input records: 1190 (4) Map output records: 1190 (5) Reduce input records: 1190 (6) Reduce output records: 0 (A) Where else I should check ? (B) I noticed that "Reduce output records=0", it looks not normal. However, I also checked the job that uses 10 reducers and fine to sampling..It also with "Reduce output records=0". thought ? Thanks.

Jason.Chen · ‎06-12-2015

Sean, Yes, we tried that.. We took the long IDs of the 7.5 million users (yes, the long ID is the one that Oryx generates by hashing) and about 2021 of them are 0 mod 3673.. So it looks right. It's odd it's not passing in Oryx. We have about 1200 items and the long ID mod 3673 gives us nothing (no item long ID in 0 mod 3673)... Some questions to follow. (1) The sampling process is separate for user IDs and item IDs. Right? (2) In my previous example, I use iteration #3 and #4 as example. On 2nd thought, I am thinking the sampling processing should happen BEFORE the iteration 1 starts. Right ? I notice there are several "data pre-processing" step (e.g., MergeIDMappingStep). I am thinking the sampling happened there (MergeIDMappingStep) and then the same sample IDs used across each iteration. So, I am confused that the "hashcode log message" I provided is in each reducer of each iteration. Can you explain a little bit ? Thanks.

Jason.Chen · ‎06-12-2015

hm... that's strange why no IDs passed. We have 7.6 million user IDs... Question on this "...Yes, sampling is per iteration and samples the same IDs each time...." Give an example, there are 30 reducers say, in iteration 3, (1) In iteration 3 and reducer #1 It loops all the users IDs (and item IDs) inside this reducer #1 (2) In iteration 3 and reducer #2 It loops all the users IDs (and item IDs) inside this reducer #2 Then, after iteration 3, it saves the sampling IDs. Same sample IDs are then use in iteration #4 and it compares the difference between the estimated values of this samples ?

Jason.Chen · ‎06-11-2015

Sean, (1) Here includes some results: INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 2111185186541130611 hashCode= 977794330 INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3174317317673160368 hashCode= 463078209 INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3174428972624599832 hashCode= 1617905253 INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3444764202548713566 hashCode= 1628781813 INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3653094543606133455 hashCode= 1773010709 (2) In the Hadoop log, I do not see any log info about the following. Based on this, it seems no ID passes "if (userIDString.hashCode() % convergenceSamplingModulus == 0) " check... log.info(Integer.toString(Long.toString(itemID).hashCode())); (3) Can you in overall explain how the sampling is working ? (a) Is it sampling in each reducer of each iteration ? (b) When it samples, is it looping into all Long User IDs and Long Item IDs and then apply mod ? I saw you use hashCode in new code. Oryx 1.0.1 uses Long IDs for mod... (c) int modulus = RandomUtils.nextTwinPrime(4 * opts.getNumReducers() * opts.getNumReducers()); Why you choose modulus in this way ? Thanks.

Jason.Chen · ‎06-10-2015

Sean, Thanks for the follow up. Yes, I can try that. Can you insert the appropriate log.info into the codes you want me to try. So, it can log proper info for you to review. Meanwhile, I did try to reduce the reducer# (from 30 to 10) and I noticed it did sample to calculate converge distance. I checked the code and it looks reducer# is used to generate the modular number. For example: Avg absolute difference in estimate vs prior iteration over 2124 samples: 0.02002799961913492 Jason

Jason.Chen · ‎06-10-2015

Sean, Yes, I saw this message for each iteration... something like: Using convergence sampling modulus 3673 to sample about 7.412388E-6% of all user-item pairs for convergence I cannot share the exact IDs.. Share the format: User-ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX (X is either alphebets or numbers) Item ID: XXXXX_xxxxxxxx (X is either alphebets in upper case or numbers; and x is either alphebets in lower case or numbers). Thanks. Jason

Online	Offline
Last Visited	‎07-06-2015 01:40 AM

Member Since	‎07-18-2014 11:03 PM
Last Visited	‎07-06-2015 01:40 AM
Posts	74

Cloudera Community

Re: Run Oryx on a machine that is not part of the ...

Re: Run Oryx on a machine that is not part of the ...

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop

Re: Oryx ALS running with Hadoop