Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive queries use only mappers or only reducers

Solved Go to solution
Highlighted

Hive queries use only mappers or only reducers

Contributor

I'm looking for Hive query scenarios, where it uses only mappers or only reducers.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Hive queries use only mappers or only reducers

Super Guru

Hi @Ramya Jayathirtha

in hive if you do simple query like select * from table there will be no map reduce job is going to run as we are just dumping the data.

Hive# select * from foo;
+---------+-----------+----------+--+
| foo.id  | foo.name  | foo.age  |
+---------+-----------+----------+--+
| 1       | a         | 10       |
| 2       | a         | 10       |
| 3       | b         | 10       |
| 4       | c         | 20       |
+---------+-----------+----------+--+
4 rows selected (0.116 seconds)

you can use explain by adding before with your query, it will displays how the query is going to execute by execution engine and display how many map reduce phases are going to be done for the query.

Hive# explain select * from foo;
+-------------------------------------------------------+--+
|                        Explain                        |
+-------------------------------------------------------+--+
| Plan not optimized by CBO.                            |
|                                                       |
| Stage-0                                               |
|    Fetch Operator                                     |
|       limit:-1                                        |
|       Select Operator [SEL_5652]                      |
|          outputColumnNames:["_col0","_col1","_col2"]  |
|          TableScan [TS_5651]                          |
|             alias:foo                                 |
|                                                       |
+-------------------------------------------------------+--+

When ever you do aggregations then the reducer phase will be executed along with map phase.

Hive# select count(*) from table group by name;
INFO  : Map 1: 0/1      Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 1/1      Reducer 2: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0(+1)/1
INFO  : Map 1: 1/1      Reducer 2: 1/1
+------+--+
| _c0  |
+------+--+
| 2    |
| 1    |
| 1    |
+------+--+
3 rows selected (13.709 seconds)

if you add Explain in front of above query it will displays

Hive# explain select count(*) from foo group by name;
Reducer 2 <- Map 1 (SIMPLE_EDGE)                    
as you can see reducer phase along with map phase.

we can add another reducer phase to above query by adding order by clause to it

Hive# select count(*) cnt from foo group by name order by cnt;
INFO  : Map 1: 0/1      Reducer 2: 0/2  Reducer 3: 0/1
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2  Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0/1  Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0(+1)/1      Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 1/1  Reducer 3: 0(+1)/1
INFO  : Map 1: 1/1      Reducer 2: 1/1  Reducer 3: 1/1
+------+--+
| cnt  |
+------+--+
| 1    |
| 1    |
| 2    |
+------+--+

you can see 2 reducer phases are done because after aggregating we are doing order by to the results

Map1 phase:- Loads the data from HDFS.

Reduer2:- Will does aggregation

Reducer 3:- after aggregation it will order the results to ascending order.

if you do explain on the above query

Hive# explain select count(*) cnt from foo group by name order by cnt;
 Vertex dependency in root stage     
 Reducer 2 <- Map 1 (SIMPLE_EDGE)    
 Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
4 REPLIES 4

Re: Hive queries use only mappers or only reducers

Super Guru

Hi @Ramya Jayathirtha

in hive if you do simple query like select * from table there will be no map reduce job is going to run as we are just dumping the data.

Hive# select * from foo;
+---------+-----------+----------+--+
| foo.id  | foo.name  | foo.age  |
+---------+-----------+----------+--+
| 1       | a         | 10       |
| 2       | a         | 10       |
| 3       | b         | 10       |
| 4       | c         | 20       |
+---------+-----------+----------+--+
4 rows selected (0.116 seconds)

you can use explain by adding before with your query, it will displays how the query is going to execute by execution engine and display how many map reduce phases are going to be done for the query.

Hive# explain select * from foo;
+-------------------------------------------------------+--+
|                        Explain                        |
+-------------------------------------------------------+--+
| Plan not optimized by CBO.                            |
|                                                       |
| Stage-0                                               |
|    Fetch Operator                                     |
|       limit:-1                                        |
|       Select Operator [SEL_5652]                      |
|          outputColumnNames:["_col0","_col1","_col2"]  |
|          TableScan [TS_5651]                          |
|             alias:foo                                 |
|                                                       |
+-------------------------------------------------------+--+

When ever you do aggregations then the reducer phase will be executed along with map phase.

Hive# select count(*) from table group by name;
INFO  : Map 1: 0/1      Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2
INFO  : Map 1: 1/1      Reducer 2: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0(+1)/1
INFO  : Map 1: 1/1      Reducer 2: 1/1
+------+--+
| _c0  |
+------+--+
| 2    |
| 1    |
| 1    |
+------+--+
3 rows selected (13.709 seconds)

if you add Explain in front of above query it will displays

Hive# explain select count(*) from foo group by name;
Reducer 2 <- Map 1 (SIMPLE_EDGE)                    
as you can see reducer phase along with map phase.

we can add another reducer phase to above query by adding order by clause to it

Hive# select count(*) cnt from foo group by name order by cnt;
INFO  : Map 1: 0/1      Reducer 2: 0/2  Reducer 3: 0/1
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/2  Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0/1  Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 0(+1)/1      Reducer 3: 0/1
INFO  : Map 1: 1/1      Reducer 2: 1/1  Reducer 3: 0(+1)/1
INFO  : Map 1: 1/1      Reducer 2: 1/1  Reducer 3: 1/1
+------+--+
| cnt  |
+------+--+
| 1    |
| 1    |
| 2    |
+------+--+

you can see 2 reducer phases are done because after aggregating we are doing order by to the results

Map1 phase:- Loads the data from HDFS.

Reduer2:- Will does aggregation

Reducer 3:- after aggregation it will order the results to ascending order.

if you do explain on the above query

Hive# explain select count(*) cnt from foo group by name order by cnt;
 Vertex dependency in root stage     
 Reducer 2 <- Map 1 (SIMPLE_EDGE)    
 Reducer 3 <- Reducer 2 (SIMPLE_EDGE)

Re: Hive queries use only mappers or only reducers

Contributor

@Shu Thank you for the explanation.

I wanted to know Hive queries (Hive sql) where there is no reducer phase at all, only mapper phase. Is there such an example ?

Re: Hive queries use only mappers or only reducers

Super Guru

@Ramya Jayathirtha,

As i'm having id,name,age columns in foo table when ever we does

Hive# select name from foo; //in this case first map phase will loads the file and we only selected name column, we are not doing any filtering kind of things here so map phase checks name field and gives results.

MapSideJoins:-

Usually all joins will perform on reducer side as we can explicitly mention load tables to memory and performs joins, no reducer phase will be initialized.

Hive# select /*+MAPJOIN(..)*/... //this kind of joins will loads small table to memory and does the join on map phase only.

When ever we do insert values into table and loading the data should be used only map phase.

Hive# insert into foo values(1,'abc',200);
INFO  : Map 1: -/-
INFO  : Map 1: 0/1
INFO  : Map 1: 0(+1)/1
INFO  : Map 1: 1/1
INFO  : Table default.foo stats: [numFiles=5, numRows=5, totalSize=38, rawDataSize=33]

Simple CTAS without Aggregations:-

When we does Create table as simple select then only mapper phase will be initialized.

if we does any aggregations then reducer phase will get initialized

Hive#create table foo1 stored as orc as select * from foo
INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Table default.foo1 stats: [numFiles=1, numRows=4, totalSize=XXX, rawDataSize=XXXX] No rows affected (10.247 seconds)
Hive#select * from foo1;
+----------+------------+-----------+--+
| foo1.id  | foo1.name  | foo1.age  |
+----------+------------+-----------+--+
| 1        | a          | 10        |
| 2        | a          | 11        |
| 2        | a          | 10        |
| 3        | b          | 10        |
| 4        | b          | 10        |
| 5        | c          | 10        |
+----------+------------+-----------+--+
6 rows selected (0.205 seconds)

2. if we does CTAS with where clause in it still it is just map phase all the filters in WHERE clause are going to be done by mapper phase it self.

Hive#create table foo as select * from foo1 where id='1';
INFO  : Map 1: -/-
INFO  : Map 1: 0/1
INFO  : Map 1: 0(+1)/1
INFO  : Map 1: 1/1
INFO  : Table default.foo stats: [numFiles=1, numRows=1, totalSize=7, rawDataSize=6]
No rows affected (9.984 seconds)
Hive#SELECT * FROM FOO;
+---------+-----------+----------+--+
| foo.id  | foo.name  | foo.age  |
+---------+-----------+----------+--+
| 1       | a         | 10       |
+---------+-----------+----------+--+
1 row selected (0.099 seconds)

Re: Hive queries use only mappers or only reducers

New Contributor

@Shu

How is number of Mappers/reducers decided for a given query will be decided in runtime ?

Is it dependet on how many number of Joins or group by or order by clauses that are used in the query ?

If yes, then please let me know how many mappers and reducers are launched for the below query.

select name, count(*) as cnt from test group by name order by name;