Support Questions

Jason.Chen · ‎05-29-2015

Sean,

Follow up some scenarios I posted before, but post it in a separate thread...

I am using Oryx 1.0 with Hadoop (CDH 5.4.1). It ran slow and I tuned the mapper-memory-mb and reducer-memory-mb..

Not helpful.

Is is possible to tune Oryx config to (1) Tune the number of map and reduce tasks appropriately (2) Use LZO Compression for map

output ?

Thanks.

srowen · ‎05-30-2015

As I say, I don't think memory helps unless you are memory bound. It does not increase performance. You should let hadoop choose the number of mappers in general. I think it would be more helpful to know anything about your data and problem in order to recommend where to look. It sounds like your data is so small that this is all Hadoop overhead, and 'tuning' doesn't help in that it does not reflect how a large data set would behave.

View solution in original post

srowen · ‎05-30-2015

As I say, I don't think memory helps unless you are memory bound. It does not increase performance. You should let hadoop choose the number of mappers in general. I think it would be more helpful to know anything about your data and problem in order to recommend where to look. It sounds like your data is so small that this is all Hadoop overhead, and 'tuning' doesn't help in that it does not reflect how a large data set would behave.

Jason.Chen · ‎05-30-2015

OK.

Understood that the 3-4 GB data is so called "so small" to see the benefits using Hadoop (due to the overhead).

We are collecting data and it grows fast.

Will see if Hadoop based computation scales fine with much larger data.

Thanks.

Jason.Chen · ‎06-02-2015

Sean,

Two more questions, as I checked Hadoop logs and Oryx computation logs.. We want to understand how Oryx computation works with Hadoop.

(1) When it computes X or Y (with Hadoop), from the Oryx logs, it indicates for examples, "number of splits:2" and "Total input paths to process : 11"

In the number determined by Hadoop automatically or it's determined by Oryx. I checked Oryx codes and cannot find those.

(2) My question is that if inside Oryx codes, it controls how many reducers to run on each node simultaneously ?

For example, "mapreduce.tasktracker.reduce.tasks.maximum" is overwritten...?

srowen · ‎06-02-2015

Yes, the number of splits and therefore Mapper tasks is determined by Hadoop MapReduce and this is not altered or overridden.

11 is a default number of Reducer tasks which you can change. (For various reasons a prime number is a good choice.) Yes, you will see as many run simultaneously as you have reducer slots. This is determined by MapReduce and defaults to 1 per machine but can be changed if you know the machine can handle many more.

This is all just Hadoop machinery, yeah, not specific to this app.

Cloudera Community

Support Questions

Tuning Hadoop parameters with Oryx 1.0