About Jason.Chen

horatio · ‎08-24-2015

Actually, I don't know the exact reasons and had stuck in this problem for a few day with firewalls on all machines disabled at very first. I used to deploy hadoop, spark and so on by extracting source tarballs. Forturnately, edge node seems to be a good idea to acess cluster resources.

srowen · ‎07-02-2015

It is just polling HDFS for new files on the order of ~5 minutes or so. No that message is exactly from this process of refreshing the model by looking for any new model. "No available generation" means no models have been built. There's a delay between the time new data arrives -- which could include a new user or item -- and when that is incorporated into a model. It could be a long time depending on how long you take to build models. When a new model arrives, you can't just drop all existing users, since the new model won't have any info about very new users or items. This is to help keep track of which users/items should be retained in memory even if they do not exist in the new model. The new model replaces the old one user-by-user and item-by-item rather than by loading an entire new model. Yes you have a state with old and new data at once but this is fine for recommendations; they're not incompatible. It's just the current and newer state of an estimate of the user/item vectors.

Jason.Chen · ‎06-29-2015

Cool. Just read your changes and it seems it only impacts the local computation (not Hadoop computation). Correct? Yes, I know Hadoop computation is already doing the right thing and no need to fix.

srowen · ‎06-22-2015

Sure guys, let me know if it seems to work. Once this is resolved I am going to cut a 1.1.0 release.

srowen · ‎06-02-2015

Yes, the number of splits and therefore Mapper tasks is determined by Hadoop MapReduce and this is not altered or overridden. 11 is a default number of Reducer tasks which you can change. (For various reasons a prime number is a good choice.) Yes, you will see as many run simultaneously as you have reducer slots. This is determined by MapReduce and defaults to 1 per machine but can be changed if you know the machine can handle many more. This is all just Hadoop machinery, yeah, not specific to this app.

srowen · ‎05-29-2015

Yes, that's a good reason, if you have to scale up past one machine. Previously I thought you mean you were running an entire Hadoop cluster on one machine, which is fine for a test but much slower and more complex than a simple non-Hadoop 1-machine setup. I The mapper and reducer will need more memory if you see them running out of memory. If memory is very low but not exhausted, a Java process slows down in too much GC. Otherwise more memory does not help. More nodes does not necessarily help. You still face the overhead of task scheduling and data transfer, and the time taken to do non-distributed work. In fact, if you set up your workers to not live on the same nodes as data nodes, it will be a lot slower. For your scale, which fits in one machine easily, 7 nodes is big overkill, and 60 is way too big to provide any advantage. You're measuring pure Hadoop overhead, which you can tune, but is not reflecting work done. The upshot is you should be able to handle data sizes hundreds or thousands of times larger this way, at roughly the same amount of time. For small data sets, you see why there is no value in trying to use a large cluster; it's just too tiny to split up.

srowen · ‎05-22-2015

ALS: yes, fold-in just as before k-means: assign point to a cluster and update its centroid (but don't reassign any other points) RDF: assign point to leaf and update leaf's prediction (but don't change the rest of the tree)

srowen · ‎04-10-2015

This is just Java's locking library, it's not specific to the project. This is a lock that supports many readers at one time, but, only one writer at a time (and no readers while a writer has the write lock). You have to acquire the write lock to mutate the shared state, but also need to acquire the read lock to read it -- but, you won't exclude other readers.

srowen · ‎02-28-2015

Have you set it to start a generation based on the amount of input received? that could be triggering the new computation. That said are you sure it only has part of the input? it's possible the zipped file sizes aren't that comparable. Yes, you simply don't have enough memory allocated to your JVM. Your system memory doesn't matter if you haven't let the JVM use much of it. This is in local mode right? you need to use -Xmx to give more heap. Yes it will use different tmp directories for different jobs. That's normal.

Jason.Chen · ‎02-26-2015

ya. It looks my local issue to locate Guava. Thanks for your reply.

Online	Offline
Last Visited	‎07-06-2015 01:40 AM

Member Since	‎07-18-2014 11:03 PM
Last Visited	‎07-06-2015 01:40 AM
Posts	74

Cloudera Community

Re: Run Oryx on a machine that is not part of the ...

Re: How Oryx serving layer knows there is new mode...

Re: Lost of users after training

Re: Oryx ALS running with Hadoop

Re: Tuning Hadoop parameters with Oryx 1.0

Re: Oryx log info of ALS

Re: Speed layer in Oryx2

Re: Retrieve and modify latent feature vectors on ...

Re: Questions on several API end points and model

Re: Cannot build Oryx 1.0.2