Support Questions

Find answers, ask questions, and share your expertise

Tuning Hadoop parameters with Oryx 1.0

avatar
Explorer

Sean,

 

Follow up some scenarios I posted before, but post it in a separate thread...

I am using Oryx 1.0 with Hadoop (CDH 5.4.1). It ran slow and I tuned the mapper-memory-mb and reducer-memory-mb..

Not helpful.

Is is possible to tune Oryx config to (1) Tune the number of map and reduce tasks appropriately (2) Use LZO Compression for map

output ?

 

Thanks.

 

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

As I say, I don't think memory helps unless you are memory bound. It does not increase performance. You should let hadoop choose the number of mappers in general. I think it would be more helpful to know anything about your data and problem in order to recommend where to look. It sounds like your data is so small that this is all Hadoop overhead, and 'tuning' doesn't help in that it does not reflect how a large data set would behave.

View solution in original post

4 REPLIES 4

avatar
Master Collaborator

As I say, I don't think memory helps unless you are memory bound. It does not increase performance. You should let hadoop choose the number of mappers in general. I think it would be more helpful to know anything about your data and problem in order to recommend where to look. It sounds like your data is so small that this is all Hadoop overhead, and 'tuning' doesn't help in that it does not reflect how a large data set would behave.

avatar
Explorer

OK.

 

Understood that the 3-4 GB data is so called "so small" to see the benefits using Hadoop (due to the overhead).

We are collecting data and it grows fast.

Will see if Hadoop based computation scales fine with much larger data.

 

Thanks.

avatar
Explorer

Sean,

 

Two more questions, as I checked Hadoop logs and Oryx computation logs.. We want to understand how Oryx computation works with Hadoop.

 

(1) When it computes X or Y (with Hadoop), from the Oryx logs, it indicates for examples, "number of splits:2" and "Total input paths to process : 11"

In the number determined by Hadoop automatically or it's determined by Oryx. I checked Oryx codes and cannot find those.

 

(2) My question is that if inside Oryx codes, it controls how many reducers to run on each node simultaneously ?

For example, "mapreduce.tasktracker.reduce.tasks.maximum" is overwritten...?

 

 

 

 

 

avatar
Master Collaborator
Yes, the number of splits and therefore Mapper tasks is determined by Hadoop MapReduce and this is not altered or overridden.

11 is a default number of Reducer tasks which you can change. (For various reasons a prime number is a good choice.) Yes, you will see as many run simultaneously as you have reducer slots. This is determined by MapReduce and defaults to 1 per machine but can be changed if you know the machine can handle many more.

This is all just Hadoop machinery, yeah, not specific to this app.