Member since
07-18-2014
74
Posts
0
Kudos Received
0
Solutions
06-28-2015
11:03 PM
It drops the connection issue in RowStep.. One example of the detailed log is as below (I slightly modified the info to hide some sensitive server info, but it keeps main messages): One thing odd is that even it reports cannot reaching the server:port (say 10.190.36.114:40915) as below, it's still eventually completing the job. I am thinking maybe it completes with other nodes in a "standard" port? However, it's still not good a sign seeing cannot connect to server, because it introduces unnecessary running time. /// Logs //// Thu May 28 07:27:57 PDT 2015 INFO Running job "Oryx-/user/xyz/int/def-1-122-Y-RowStep: Avro(hdfs://server105:8020/u... ID=1 (1/1)" Thu May 28 07:27:57 PDT 2015 INFO Job status available at: http://server105:8088/proxy/application_1432750221048_0525/ Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3 Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3 Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 2 time(s); maxRetries=3 ... Thu May 28 07:34:15 PDT 2015 INFO Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server Thu May 28 07:34:16 PDT 2015 INFO Finished Oryx-/user/xyz/int/def-1-122-Y-RowStep Thu May 28 07:34:16 PDT 2015 INFO Completed RowStep in 379s
... View more
06-28-2015
11:23 AM
Sean, I have some follow-up questions regarding this topic ("Run Oryx on a machine that is not part of the cluster").. We started to test the case that Oryx 1.0 computation/serving layers running on VMs that are in different virtual LAN from the Hadoop Cluster. There are firewall port issues for the communication between the two virtual LANs. Therefore, we opened the all the Hadoop used ports on the Hadoop Cluster virtual LAN, so that the Oryx VMs can talk to it. We got the "Hadoop used port list" from both the Hadoop configuration files and also some online Cloudera CDH port info. After doing that, yes, Oryx is able to submit jobs to Hadoop cluster at some level. However, it still drops some communications issue. For example, from the Oryx log, I see something like this Retrying connect to server: server-name/10.190.36.113:40651. Already tried 0 time(s); maxRetries=3 Retrying connect to server: server-name/10.190.36.113:40651. Already tried 1 time(s); maxRetries=3 .... Retrying connect to server: server-name/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3 Retrying connect to server: server-name/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3 .... I dig into the codes and I do not understand why this could happen. My questions... (1) Is the communication between Oryx to Hadoop is bidirectional OR unidirectional? My understanding is that Oryx uses the Hadoop configuration files to get the idea where (server and port) it should submit the jobs. After Oryx submits the job, how Oryx knows the job is completed? Does Oryx check with Hadoop to get the status? Or, Hadoop communicates back to Oryx VM regarding the status ? (2) Related to (1) and the log info I post above: Are there "dynamic" ports are used during the Oryx-Hadoop communications? From the log message, I see ports 40651 and 40915.. They seem to not standard Hadoop ports and even these port numbers are dynamically changing. This is confusing. Thanks.
... View more
06-13-2015
07:02 AM
Sean, Thanks so much. Yes, we would like to make a try! Since I am travelling, I will pass to my co-workers (Ying Lu or Jinsu Oh) to try it out. Thanks a lot. Jason
... View more
06-12-2015
12:09 PM
Oh, 1190 is item# (job of Y) 7.3 million users seems in the input to X job. Yes, we are tracing the input size to see what happened... Meanwhile, side questions: (1) Comparing single VM and Hadoop, is it the same Hash function you use to hash IDs (user IDs and item IDs) to long IDs? (2) Before calling ConvergenceSampleFn, is there another pre-processing could possibly cut the IDs down ? I traced some codes, but cannot identify those. Thanks
... View more
06-12-2015
09:44 AM
Sean, Can you explain a little bit where I can identify such info ? I check one particular job status (a Y job named "....0-3-Y-RowStep...") from Hadoop UI... This is a job that uses 30 reducers and failed to sampling.. I saw the "Map-Reduce Framework" counter information, there are (1) combine input records: all zeros in our case (2) combine output records: all zeros in our case (3) Map input records: 1190 (4) Map output records: 1190 (5) Reduce input records: 1190 (6) Reduce output records: 0 (A) Where else I should check ? (B) I noticed that "Reduce output records=0", it looks not normal. However, I also checked the job that uses 10 reducers and fine to sampling..It also with "Reduce output records=0". thought ? Thanks.
... View more
06-12-2015
08:37 AM
Sean, Yes, we tried that.. We took the long IDs of the 7.5 million users (yes, the long ID is the one that Oryx generates by hashing) and about 2021 of them are 0 mod 3673.. So it looks right. It's odd it's not passing in Oryx. We have about 1200 items and the long ID mod 3673 gives us nothing (no item long ID in 0 mod 3673)... Some questions to follow. (1) The sampling process is separate for user IDs and item IDs. Right? (2) In my previous example, I use iteration #3 and #4 as example. On 2nd thought, I am thinking the sampling processing should happen BEFORE the iteration 1 starts. Right ? I notice there are several "data pre-processing" step (e.g., MergeIDMappingStep). I am thinking the sampling happened there (MergeIDMappingStep) and then the same sample IDs used across each iteration. So, I am confused that the "hashcode log message" I provided is in each reducer of each iteration. Can you explain a little bit ? Thanks.
... View more
06-12-2015
12:01 AM
hm... that's strange why no IDs passed. We have 7.6 million user IDs... Question on this "...Yes, sampling is per iteration and samples the same IDs each time...." Give an example, there are 30 reducers say, in iteration 3, (1) In iteration 3 and reducer #1 It loops all the users IDs (and item IDs) inside this reducer #1 (2) In iteration 3 and reducer #2 It loops all the users IDs (and item IDs) inside this reducer #2 Then, after iteration 3, it saves the sampling IDs. Same sample IDs are then use in iteration #4 and it compares the difference between the estimated values of this samples ?
... View more
06-11-2015
11:10 PM
Sean, (1) Here includes some results: INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 2111185186541130611 hashCode= 977794330 INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3174317317673160368 hashCode= 463078209 INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3174428972624599832 hashCode= 1617905253 INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3444764202548713566 hashCode= 1628781813 INFO [main] com.cloudera.oryx.als.computation.iterate.row.ConvergenceSampleFn: userIDString= 3653094543606133455 hashCode= 1773010709 (2) In the Hadoop log, I do not see any log info about the following. Based on this, it seems no ID passes "if (userIDString.hashCode() % convergenceSamplingModulus == 0) " check... log.info(Integer.toString(Long.toString(itemID).hashCode())); (3) Can you in overall explain how the sampling is working ? (a) Is it sampling in each reducer of each iteration ? (b) When it samples, is it looping into all Long User IDs and Long Item IDs and then apply mod ? I saw you use hashCode in new code. Oryx 1.0.1 uses Long IDs for mod... (c) int modulus = RandomUtils.nextTwinPrime(4 * opts.getNumReducers() * opts.getNumReducers()); Why you choose modulus in this way ? Thanks.
... View more
06-10-2015
10:30 PM
Sean, Thanks for the follow up. Yes, I can try that. Can you insert the appropriate log.info into the codes you want me to try. So, it can log proper info for you to review. Meanwhile, I did try to reduce the reducer# (from 30 to 10) and I noticed it did sample to calculate converge distance. I checked the code and it looks reducer# is used to generate the modular number. For example: Avg absolute difference in estimate vs prior iteration over 2124 samples: 0.02002799961913492 Jason
... View more
06-10-2015
05:14 PM
Sean, Yes, I saw this message for each iteration... something like: Using convergence sampling modulus 3673 to sample about 7.412388E-6% of all user-item pairs for convergence I cannot share the exact IDs.. Share the format: User-ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX (X is either alphebets or numbers) Item ID: XXXXX_xxxxxxxx (X is either alphebets in upper case or numbers; and x is either alphebets in lower case or numbers). Thanks. Jason
... View more