Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Does Tez run slower than hive on larger dataset (~2.5 TB)?

Solved Go to solution
Highlighted

Does Tez run slower than hive on larger dataset (~2.5 TB)?

New Contributor

We have started to look into testing tez query engine. From initial results, we are getting 30% performance boost over Hive on smaller data set(1-10 GB) but Hive starts to perform better than Tez as data size increases. Like when we run a hive query with Tez on about 2.3 TB worth of data, it performs worse than hive alone.(~20% less performance) Details are in the post below.

On a cluster with 1.3 TB RAM, I set the following property :

set tez.task.resource.memory.mb=10000; set tez.am.resource.memory.mb=59205; set tez.am.launch.cmd-opts =-Xmx47364m; set hive.tez.container.size=59205; set hive.tez.java.opts=-Xmx47364m; set tez.am.grouping.max-size=36700160000;

Is it normal or I am missing some property / not configuring some property properly? Also, I am using an older version of Tez as of now. Could that be the issue too? I still to bootstrap latest version of Tez on EMR and test it and see if that could do any better

http://www.jwplayer.com/blog/hive-with-tez-on-emr/

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Does Tez run slower than hive on larger dataset (~2.5 TB)?

I would think this is not enough information to give you an answer. Tez will normally be faster even on big amounts of data but your setup is pretty unusual. Huge amounts of memory ( Hive is normally CPU or IO bound so more nodes are in general a good idea ). And very big tasks which could lead to some Garbage Collection issues ( tez would reuse tasks which would not happen in mapreduce ) . Also you don't change the sort memory for example so that could still be small.

The biggest question I have is cluster utilization and CPU utilization when you run both jobs. I.e. is the cluster fully utilized (run top on a node ) when you run the mapred job but not in tez?. Or is he waiting in a specific node? Tez dynamically adjusts the number of reducers running so it is possible that it decided on less tasks. Running set hive.tez.exec.print.summary=true; can help you figure out which part of your query took longest.

The second question would be query complexity. Tez allows the reuse of tasks during execution so complex queries work better. It is always possible that there is a query that will run better on MapReduce.

4 REPLIES 4

Re: Does Tez run slower than hive on larger dataset (~2.5 TB)?

Rising Star

Do you know which version you're running? Because the following configuration parameters don't seem to be related to hive+tez.

set tez.task.resource.memory.mb=10000;

Check your hive.tez.* parameters.

Re: Does Tez run slower than hive on larger dataset (~2.5 TB)?

New Contributor

I found that property here. http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_installing_manually_book/content/ref-ffec.... I used the bootstrap script provided by amazon. After logging on the cluster, I realized I am using one of the oldest version of Tez. Like 0.4. I still have to try with the latest version but was thinking if that would give that much performance boost.

Re: Does Tez run slower than hive on larger dataset (~2.5 TB)?

I would think this is not enough information to give you an answer. Tez will normally be faster even on big amounts of data but your setup is pretty unusual. Huge amounts of memory ( Hive is normally CPU or IO bound so more nodes are in general a good idea ). And very big tasks which could lead to some Garbage Collection issues ( tez would reuse tasks which would not happen in mapreduce ) . Also you don't change the sort memory for example so that could still be small.

The biggest question I have is cluster utilization and CPU utilization when you run both jobs. I.e. is the cluster fully utilized (run top on a node ) when you run the mapred job but not in tez?. Or is he waiting in a specific node? Tez dynamically adjusts the number of reducers running so it is possible that it decided on less tasks. Running set hive.tez.exec.print.summary=true; can help you figure out which part of your query took longest.

The second question would be query complexity. Tez allows the reuse of tasks during execution so complex queries work better. It is always possible that there is a query that will run better on MapReduce.

Re: Does Tez run slower than hive on larger dataset (~2.5 TB)?

Mentor

@Rohit Garg are you still having issues with this? Can you accept best answer or provide your own solution?

Don't have an account?
Coming from Hortonworks? Activate your account here