Support Questions
Find answers, ask questions, and share your expertise

Flume's write job and Hive's read job affects each other about disk bandwidth.

Explorer

Hi.

I am operating a hadoop cluster and flume writes data continuesly.

When hadoop run heavy M/R job(via hive on tez) almost of datanodes' disk io resource spent for M/R job.

At the same time flume can not sink channel data to hdfs sink and channel fill rate getting high.

---------------------------------------------------------

MR -- high read --> HDFS < -- write -- Flume

---------------------------------------------------------

Solutions in general.

1. add more and more(or fast) disks. and spread read/write well.

-> most easy, but will met same limitation when x times large M/R ran. by same rule.

2. use storage policy, hot / warm / cold architecture.

-> make flume write to hot

-> large M/R may read from warm or cold.

-> I have read these manual, but have no experience to this. and seems not easy..

3. others.

What will be a good approach?

1 REPLY 1

Re: Flume's write job and Hive's read job affects each other about disk bandwidth.

@no jihun

A good and simple approach will be to use capacity scheduler. Assuming you are using different users/groups configure multiple capacity queues. Remember that does not mean that you cluster will be under utilized when only one user is running (unless you forcibly configure it that way)

A good tutorial is here to get you started:

http://hortonworks.com/hadoop-tutorial/configuring-yarn-capacity-scheduler-ambari/

You can fine tune as much as you want. This is a well tested and very popular feature.