Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

On processing Large volumes tables MR is performing better than TEZ, But All forums says its TEZ that always better than MR. Please suggest.

avatar
New Contributor

We are doing some analysis on MR vs TEZ. TEZ is doing better than MR on small and mild data volumes but MR is beating TEZ on large volumes, We have seen it multiple times on different test beds. Please suggest

1 ACCEPTED SOLUTION

avatar
New Contributor

1)Please define actual size and performance numbers that you encountered.

Ans.

  • Data Volume
  • Time elapsed for TEZ
  • Average Time MR
  • Time elapsed for MR
  • Average Time for TEZ
  • 1900 records
  • 46.350 secs
  • 41.626 secs
  • 63.666 secs
  • 56.176 secs
  • 40.341 secs
  • 55.633 secs
  • 38.189 secs
  • 49.230 secs
  • 91914 records
  • 32.049 secs
  • 32.097 secs
  • 52.920 secs
  • 51.236 secs
  • 32.088 secs
  • 49.030 secs
  • 32.156 secs
  • 51.760 secs
  • 993168 records
  • 850.01 secs
  • 861.781 secs
  • 611.625 secs
  • 635.781 secs
  • 865.230 secs
  • 691.751 secs
  • 872.110 secs
  • 672.285 secs
  • 868.995 secs
  • 567.466 secs

2)Clarify what test beds you are referring and how did you use them?

Ans. In above statistics table:

In Operation 1 is a creating lateral view on a small data set.

In Operation 2 is joining 3 tables of intermediate data volume.

In Operation 3 is joining 4 tables of large data volume in inner query and aggregation happening on top of that.

3)Clarify what is the type of test case you execute? It is important to clarify because some tests can be disk I/O intensive, others can be memory intensive.

  • 1.Ans. Above jobs ran in parallel i.e. 10 jobs in parallel on TEZ mode and 10 jobs in parallel on MR mode.
  • 2.Above results are output of multiple test iterations and performed on different test beds.

View solution in original post

4 REPLIES 4

avatar
Super Guru

@Vipul Choudhary

1) Please define actual size and performance numbers that you encountered.

2) Clarify what test beds you are referring and how did you use them?

3) Clarify what is the type of test case you execute? It is important to clarify because some tests can be disk I/O intensive, others can be memory intensive.

After clarifying all the above, we can state that driving a bike is sometimes faster than driving a Ferrari. That may be because the bike is better suited for niche cases where there is a little space for a car to go through (narrow roads, etc). I would not generalize that easy. I am not sure about anything stated as "is always better". There is always an exception. Anyhow, you can set the desired engine the session level, if you wish to use MR or Tez. Thus, for cases where MR performs better, use it. It is not like you have to code it when you execute a Hive query.

avatar
Master Guru

Great analogy, but I only have a bike! 🙂 I'd like to be able to say "set my.transport.engine=ferrari;" and it here it is, at my front door!

avatar
New Contributor

1)Please define actual size and performance numbers that you encountered.

Ans.

  • Data Volume
  • Time elapsed for TEZ
  • Average Time MR
  • Time elapsed for MR
  • Average Time for TEZ
  • 1900 records
  • 46.350 secs
  • 41.626 secs
  • 63.666 secs
  • 56.176 secs
  • 40.341 secs
  • 55.633 secs
  • 38.189 secs
  • 49.230 secs
  • 91914 records
  • 32.049 secs
  • 32.097 secs
  • 52.920 secs
  • 51.236 secs
  • 32.088 secs
  • 49.030 secs
  • 32.156 secs
  • 51.760 secs
  • 993168 records
  • 850.01 secs
  • 861.781 secs
  • 611.625 secs
  • 635.781 secs
  • 865.230 secs
  • 691.751 secs
  • 872.110 secs
  • 672.285 secs
  • 868.995 secs
  • 567.466 secs

2)Clarify what test beds you are referring and how did you use them?

Ans. In above statistics table:

In Operation 1 is a creating lateral view on a small data set.

In Operation 2 is joining 3 tables of intermediate data volume.

In Operation 3 is joining 4 tables of large data volume in inner query and aggregation happening on top of that.

3)Clarify what is the type of test case you execute? It is important to clarify because some tests can be disk I/O intensive, others can be memory intensive.

  • 1.Ans. Above jobs ran in parallel i.e. 10 jobs in parallel on TEZ mode and 10 jobs in parallel on MR mode.
  • 2.Above results are output of multiple test iterations and performed on different test beds.

avatar
New Contributor

@Constantin Stanca Any thoughts on this?