Support Questions

Find answers, ask questions, and share your expertise

Oryx log info of ALS

avatar
Explorer

Sean,

 

Want to know a little more about Oryx logs as below (ALS computation).

In particular, what's the heap number ? Is it implying the MEM used by Oryx computation layer during the model computation time ?

Sometimes, we see the number is not close to the heap initialized to Oryx, but it signals a warning.  So, want to confirm what's the heap number shown below.

 

Thanks.

Jason

 

 

Sat May 23 08:57:48 PDT 2015 INFO 5800000 X/tag rows computed (7876MB heap)
Sat May 23 08:57:50 PDT 2015 INFO 5900000 X/tag rows computed (10487MB heap)
Sat May 23 08:57:53 PDT 2015 INFO 6000000 X/tag rows computed (7108MB heap)

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Yes, that's a good reason, if you have to scale up past one machine. Previously I thought you mean you were running an entire Hadoop cluster on one machine, which is fine for a test but much slower and more complex than a simple non-Hadoop 1-machine setup. I The mapper and reducer will need more memory if you see them running out of memory. If memory is very low but not exhausted, a Java process slows down in too much GC. Otherwise more memory does not help. More nodes does not necessarily help. You still face the overhead of task scheduling and data transfer, and the time taken to do non-distributed work. In fact, if you set up your workers to not live on the same nodes as data nodes, it will be a lot slower. For your scale, which fits in one machine easily, 7 nodes is big overkill, and 60 is way too big to provide any advantage. You're measuring pure Hadoop overhead, which you can tune, but is not reflecting work done. The upshot is you should be able to handle data sizes hundreds or thousands of times larger this way, at roughly the same amount of time. For small data sets, you see why there is no value in trying to use a large cluster; it's just too tiny to split up.

View solution in original post

13 REPLIES 13

avatar
Explorer

Sean,

 

Yes, we tried GC and it helped to identify the MEM usage.

 

We also try to investigate the MEM used by Oryx computation and use Hadoop for model computation.

 

(1) How to compute those more precisely to know the MEM needed ?

 

 

(2) We also ran Oryx in Hadoop and it runs very slow.

Good thing about Hadoop is that it can avoid the Out-of-MEM, but we do want to address the slow

computation of Hadoop. So, my question is that if any suggestions to tune Hadoop stuff in Oryx config

(say, mapper-memory-mb, reducer-memory-mb ?).

 

(3) We heard that Oryx 2.0 is using Spark and has built-in train-validation process. It looks will help to address the issues

I mentioned in (2) ?

 

 

Thanks for your time.

 

Jason

 

 

avatar
Master Collaborator

You are computing locally rather than on Hadoop right? I don't think there's an easy way to compute memory usage as it will vary somewhat with your parallelism as well as data size. I believe it will require one matrix loaded into memory locally, and that will drive most of the memory usage, and you have an estimate of that. That may help, but, I'd also just measure empirically the heap size to know for sure. You can easily watch the JVM's GC activity with a tool like jprofiler in real time, if you really want to see what's happening. There's no point in using Hadoop if you're just going to run on one machine. It will be an order of magnitude slower as there is a bunch of pointless writes to disk and all the overhead of a full distributed file system and resource scheduler. Hadoop makes sense only if you have a large cluster already, or you need fault tolerance. It sounds like you should simply get a decent estimate of your heap size requirements, which don't sound that large. It sounds like it's well under 9GB? you can easily get a machine in the cloud with tens of GB of RAM. Just do that. Oryx 2 is a completely different architecture. There is no local mode; it's all Hadoop (and Spark). It has a lot of pluses and minuses as a result. I think it would be even worse if you're trying to run on one small machine; it's really for a small cluster at least.

avatar
Explorer

Sean,

 

We are experimenting to use (a) single Computation node and (b) single Computation node plus a Hadoop cluster.

We want to see the performance difference in terms of running time for (a) and (b)

 

Questions:

(1) What do you mean "There's no point in using Hadoop if you're just going to run on one machine." ? Our data will grow up fast and

then we can not just use one VM (and continuously increasing memory). We think Hadoop MapReduce can help us to scale up when data

grows.

(2) Is tuning "mapper-memory-mb" and "reducer-memory-mb" potentially the way to "speed up" the process, as it allocates more MEM ?

 

 

Thanks.

 

Jason

 

avatar
Master Collaborator

Yes, that's a good reason, if you have to scale up past one machine. Previously I thought you mean you were running an entire Hadoop cluster on one machine, which is fine for a test but much slower and more complex than a simple non-Hadoop 1-machine setup. I The mapper and reducer will need more memory if you see them running out of memory. If memory is very low but not exhausted, a Java process slows down in too much GC. Otherwise more memory does not help. More nodes does not necessarily help. You still face the overhead of task scheduling and data transfer, and the time taken to do non-distributed work. In fact, if you set up your workers to not live on the same nodes as data nodes, it will be a lot slower. For your scale, which fits in one machine easily, 7 nodes is big overkill, and 60 is way too big to provide any advantage. You're measuring pure Hadoop overhead, which you can tune, but is not reflecting work done. The upshot is you should be able to handle data sizes hundreds or thousands of times larger this way, at roughly the same amount of time. For small data sets, you see why there is no value in trying to use a large cluster; it's just too tiny to split up.