About srowen

srowen · ‎07-02-2015

Yes but the question is why. This is just a message from the driver program saying the master can't be found. The question is what happened to the Application Master. If you find it in YARN, can you see what happened to that container? it almost surely failed to start but why?

srowen · ‎07-02-2015

It is just polling HDFS for new files on the order of ~5 minutes or so. No that message is exactly from this process of refreshing the model by looking for any new model. "No available generation" means no models have been built. There's a delay between the time new data arrives -- which could include a new user or item -- and when that is incorporated into a model. It could be a long time depending on how long you take to build models. When a new model arrives, you can't just drop all existing users, since the new model won't have any info about very new users or items. This is to help keep track of which users/items should be retained in memory even if they do not exist in the new model. The new model replaces the old one user-by-user and item-by-item rather than by loading an entire new model. Yes you have a state with old and new data at once but this is fine for recommendations; they're not incompatible. It's just the current and newer state of an estimate of the user/item vectors.

srowen · ‎07-01-2015

Just to check, you have this commit right? https://github.com/cloudera/oryx/commit/4b5e557a36f3d666bab0befc21b79efdf1fcd52d The symptom here is that the App Master for the MR job dies straight away, and can't be contacted. The important thing is to know why. For example when I looked at the AM app screen (i.e. http://[host]:8088/cluster/app/application_1435553713675_0018) I saw something like ... Application application_1435553713675_0018 failed 2 times due to AM Container for appattempt_1435553713675_0018_000002 exited with exitCode: -104 For more detailed output, check application tracking page:http://[host]:8088/proxy/application_1435553713675_0018/Then, click on links to logs of each attempt. Diagnostics: Container [pid=13840,containerID=container_1435553713675_0018_02_000001] is running beyond physical memory limits. Current usage: 421.5 MB of 384 MB physical memory used; 2.7 GB of 806.4 MB virtual memory used. Killing container. ... Do you see anything like that that says why the AM stopped?

srowen · ‎06-29-2015

Got it, that's a bug. I fixed it and pushed to master: https://github.com/cloudera/oryx/issues/115

srowen · ‎06-29-2015

For the stand-alone version? there's no Hadoop. I mean in the Oryx log yes. I suppose my next question then is if you're sure this config is being used in your stand-alone mode? You can see where it's applied in "ReadInputs".

srowen · ‎06-29-2015

It's pretty likely. It would not be in the logs but in the error shown on the attempt's (dead) container's info screen in the history server. At least, I saw the same thing exactly and this resolved it, and I can sort of see why this is now a problem in Java 7.

srowen · ‎06-29-2015

This is the problem; fix coming momentarily: https://github.com/cloudera/oryx/issues/114 I never saw a Snappy issue. I'm on CDH 5.4.2. Right now it seems to be running OK after the above.

srowen · ‎06-29-2015

No should work the same in both cases. You should see a message like "Pruning near-zero entries". Are you seeing that much? that would start to narrow it down.

srowen · ‎06-28-2015

Yes, if model.decay.zeroThreshold is positive then anything whose abs is smaller is pruned. This can mean entire users are removed if none of their prefs survive. Do you set this or decay.factor? by default it's all off and nothing decays though.

srowen · ‎06-28-2015

I see the same thing now. I bet that if you click through to the failed container you see an error like Diagnostics: Container [pid=13840,containerID=container_1435553713675_0018_02_000001] is running beyond physical memory limits. Current usage: 421.5 MB of 384 MB physical memory used; 2.7 GB of 806.4 MB virtual memory used. Killing container. If so then at least we have the cause. I see what is failing but not yet why as there's not a good reason the AM would only be allowed 384MB. It's a YARN config thing somewhere.

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Run Oryx on a machine that is not part of the ...

Re: How Oryx serving layer knows there is new mode...

Re: Run Oryx on a machine that is not part of the ...

Re: Lost of users after training

Re: Lost of users after training

Re: Run Oryx on a machine that is not part of the ...

Re: Run Oryx on a machine that is not part of the ...

Re: Lost of users after training

Re: Lost of users after training

Re: Run Oryx on a machine that is not part of the ...