Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3451 | 01-26-2018 04:02 AM | |
7090 | 12-22-2017 09:18 AM | |
3538 | 12-05-2017 06:13 AM | |
3857 | 10-16-2017 07:55 AM | |
11231 | 10-04-2017 08:08 PM |
05-29-2015
12:50 AM
Yes, that's a good reason, if you have to scale up past one machine. Previously I thought you mean you were running an entire Hadoop cluster on one machine, which is fine for a test but much slower and more complex than a simple non-Hadoop 1-machine setup. I The mapper and reducer will need more memory if you see them running out of memory. If memory is very low but not exhausted, a Java process slows down in too much GC. Otherwise more memory does not help. More nodes does not necessarily help. You still face the overhead of task scheduling and data transfer, and the time taken to do non-distributed work. In fact, if you set up your workers to not live on the same nodes as data nodes, it will be a lot slower. For your scale, which fits in one machine easily, 7 nodes is big overkill, and 60 is way too big to provide any advantage. You're measuring pure Hadoop overhead, which you can tune, but is not reflecting work done. The upshot is you should be able to handle data sizes hundreds or thousands of times larger this way, at roughly the same amount of time. For small data sets, you see why there is no value in trying to use a large cluster; it's just too tiny to split up.
... View more
05-28-2015
01:48 AM
You are computing locally rather than on Hadoop right? I don't think there's an easy way to compute memory usage as it will vary somewhat with your parallelism as well as data size. I believe it will require one matrix loaded into memory locally, and that will drive most of the memory usage, and you have an estimate of that. That may help, but, I'd also just measure empirically the heap size to know for sure. You can easily watch the JVM's GC activity with a tool like jprofiler in real time, if you really want to see what's happening. There's no point in using Hadoop if you're just going to run on one machine. It will be an order of magnitude slower as there is a bunch of pointless writes to disk and all the overhead of a full distributed file system and resource scheduler. Hadoop makes sense only if you have a large cluster already, or you need fault tolerance. It sounds like you should simply get a decent estimate of your heap size requirements, which don't sound that large. It sounds like it's well under 9GB? you can easily get a machine in the cloud with tens of GB of RAM. Just do that. Oryx 2 is a completely different architecture. There is no local mode; it's all Hadoop (and Spark). It has a lot of pluses and minuses as a result. I think it would be even worse if you're trying to run on one small machine; it's really for a small cluster at least.
... View more
05-25-2015
11:02 AM
I don't think maintenance releases get released as such with CDH for any component, since the release cycle and customer demand for maintenance releases are different from upstream. Important fixes are backported though, so you already have some of 1.3.1 and beyond in the 1.3.x branch in CDH. The changes aren't different; they come from upstream. Minor releases rebase on upstream minor releases and so 'sync' at that point (i.e. CDH 5.5 should have the latest minor release, whether it's 1.4.x or 1.5.x)
... View more
05-25-2015
10:19 AM
If you just mean the heap has grown to 9GB, that is normal in the sense that it does not mean 9GB of memory is actually in use. If you have an 18GB heap then a major GC has likely not happened since there is no memory pressure. I would expect this to drop significantly after a major GC. To test, you can force a GC on the running process with "jcmd GC.run" in Java 7+.
... View more
05-25-2015
12:07 AM
That's right, though there's probably a little more than this due to other JVM overheads and other much smaller data structures, but yeah that's a good start at an estimate.
... View more
05-24-2015
12:54 PM
Yes it uses Typesafe Config (https://github.com/typesafehub/config) so you should be able to set values on the command line too. Hm, maybe I should change that log to also output the current max heap, if only to be more informative and help debug. I'm not sure why you are seeing that.
... View more
05-24-2015
04:30 AM
If you're not seeing a problem you can ignore it. The thing I'd watch for is if you are nearly out of memory and are spending a lot of time in GC. If so then more heap or these other settings might help. Are you sure the heap is just 18gb? I agree this doesnt quite make sense otherwise. The memory estimate is just that but shouldn't ever be more than the heap total
... View more
05-23-2015
10:39 AM
Yes it is just the current heap usage, which is probably not near the max you set. It is normal. What warning do you mean?
... View more
05-22-2015
12:16 PM
If it's set, it probably needs to be an hdfs: path, but I don't think this setting matters in recent CDH.
... View more
05-22-2015
10:17 AM
I don't think that is used anymore in recent CDH; this is not how the assembly is distributed. What problem are you having?
... View more