About srowen

srowen · ‎05-29-2015

Yes, that's a good reason, if you have to scale up past one machine. Previously I thought you mean you were running an entire Hadoop cluster on one machine, which is fine for a test but much slower and more complex than a simple non-Hadoop 1-machine setup. I The mapper and reducer will need more memory if you see them running out of memory. If memory is very low but not exhausted, a Java process slows down in too much GC. Otherwise more memory does not help. More nodes does not necessarily help. You still face the overhead of task scheduling and data transfer, and the time taken to do non-distributed work. In fact, if you set up your workers to not live on the same nodes as data nodes, it will be a lot slower. For your scale, which fits in one machine easily, 7 nodes is big overkill, and 60 is way too big to provide any advantage. You're measuring pure Hadoop overhead, which you can tune, but is not reflecting work done. The upshot is you should be able to handle data sizes hundreds or thousands of times larger this way, at roughly the same amount of time. For small data sets, you see why there is no value in trying to use a large cluster; it's just too tiny to split up.

srowen · ‎05-28-2015

You are computing locally rather than on Hadoop right? I don't think there's an easy way to compute memory usage as it will vary somewhat with your parallelism as well as data size. I believe it will require one matrix loaded into memory locally, and that will drive most of the memory usage, and you have an estimate of that. That may help, but, I'd also just measure empirically the heap size to know for sure. You can easily watch the JVM's GC activity with a tool like jprofiler in real time, if you really want to see what's happening. There's no point in using Hadoop if you're just going to run on one machine. It will be an order of magnitude slower as there is a bunch of pointless writes to disk and all the overhead of a full distributed file system and resource scheduler. Hadoop makes sense only if you have a large cluster already, or you need fault tolerance. It sounds like you should simply get a decent estimate of your heap size requirements, which don't sound that large. It sounds like it's well under 9GB? you can easily get a machine in the cloud with tens of GB of RAM. Just do that. Oryx 2 is a completely different architecture. There is no local mode; it's all Hadoop (and Spark). It has a lot of pluses and minuses as a result. I think it would be even worse if you're trying to run on one small machine; it's really for a small cluster at least.

srowen · ‎05-25-2015

I don't think maintenance releases get released as such with CDH for any component, since the release cycle and customer demand for maintenance releases are different from upstream. Important fixes are backported though, so you already have some of 1.3.1 and beyond in the 1.3.x branch in CDH. The changes aren't different; they come from upstream. Minor releases rebase on upstream minor releases and so 'sync' at that point (i.e. CDH 5.5 should have the latest minor release, whether it's 1.4.x or 1.5.x)

srowen · ‎05-25-2015

If you just mean the heap has grown to 9GB, that is normal in the sense that it does not mean 9GB of memory is actually in use. If you have an 18GB heap then a major GC has likely not happened since there is no memory pressure. I would expect this to drop significantly after a major GC. To test, you can force a GC on the running process with "jcmd GC.run" in Java 7+.

srowen · ‎05-25-2015

That's right, though there's probably a little more than this due to other JVM overheads and other much smaller data structures, but yeah that's a good start at an estimate.

srowen · ‎05-24-2015

Yes it uses Typesafe Config (https://github.com/typesafehub/config) so you should be able to set values on the command line too. Hm, maybe I should change that log to also output the current max heap, if only to be more informative and help debug. I'm not sure why you are seeing that.

srowen · ‎05-24-2015

If you're not seeing a problem you can ignore it. The thing I'd watch for is if you are nearly out of memory and are spending a lot of time in GC. If so then more heap or these other settings might help. Are you sure the heap is just 18gb? I agree this doesnt quite make sense otherwise. The memory estimate is just that but shouldn't ever be more than the heap total

srowen · ‎05-23-2015

Yes it is just the current heap usage, which is probably not near the max you set. It is normal. What warning do you mean?

srowen · ‎05-22-2015

If it's set, it probably needs to be an hdfs: path, but I don't think this setting matters in recent CDH.

srowen · ‎05-22-2015

I don't think that is used anymore in recent CDH; this is not how the assembly is distributed. What problem are you having?

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Oryx log info of ALS

Re: Oryx log info of ALS

Re: can i upgrade spark from 1.3.0 to 1.3.1 in CDH...

Re: Oryx log info of ALS

Re: Oryx log info of ALS

Re: Oryx log info of ALS

Re: Oryx log info of ALS

Re: Oryx log info of ALS

Re: hdfs:/user/spark/share/lib/spark-assembly.jar ...

Re: hdfs:/user/spark/share/lib/spark-assembly.jar ...