Created 10-14-2017 05:06 PM
I'm looking for Hive query scenarios, where it uses only mappers or only reducers.
Created 10-14-2017 11:24 PM
in hive if you do simple query like select * from table there will be no map reduce job is going to run as we are just dumping the data.
Hive# select * from foo; +---------+-----------+----------+--+ | foo.id | foo.name | foo.age | +---------+-----------+----------+--+ | 1 | a | 10 | | 2 | a | 10 | | 3 | b | 10 | | 4 | c | 20 | +---------+-----------+----------+--+ 4 rows selected (0.116 seconds)
you can use explain by adding before with your query, it will displays how the query is going to execute by execution engine and display how many map reduce phases are going to be done for the query.
Hive# explain select * from foo; +-------------------------------------------------------+--+ | Explain | +-------------------------------------------------------+--+ | Plan not optimized by CBO. | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Select Operator [SEL_5652] | | outputColumnNames:["_col0","_col1","_col2"] | | TableScan [TS_5651] | | alias:foo | | | +-------------------------------------------------------+--+
When ever you do aggregations then the reducer phase will be executed along with map phase.
Hive# select count(*) from table group by name; INFO : Map 1: 0/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 1/1 Reducer 2: 0/1 INFO : Map 1: 1/1 Reducer 2: 0(+1)/1 INFO : Map 1: 1/1 Reducer 2: 1/1 +------+--+ | _c0 | +------+--+ | 2 | | 1 | | 1 | +------+--+ 3 rows selected (13.709 seconds)
if you add Explain in front of above query it will displays
Hive# explain select count(*) from foo group by name; Reducer 2 <- Map 1 (SIMPLE_EDGE)as you can see reducer phase along with map phase.
we can add another reducer phase to above query by adding order by clause to it
Hive# select count(*) cnt from foo group by name order by cnt; INFO : Map 1: 0/1 Reducer 2: 0/2 Reducer 3: 0/1 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 0/1 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 0(+1)/1 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 1/1 Reducer 3: 0(+1)/1 INFO : Map 1: 1/1 Reducer 2: 1/1 Reducer 3: 1/1 +------+--+ | cnt | +------+--+ | 1 | | 1 | | 2 | +------+--+
you can see 2 reducer phases are done because after aggregating we are doing order by to the results
Map1 phase:- Loads the data from HDFS.
Reduer2:- Will does aggregation
Reducer 3:- after aggregation it will order the results to ascending order.
if you do explain on the above query
Hive# explain select count(*) cnt from foo group by name order by cnt;
Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
Created 10-14-2017 11:24 PM
in hive if you do simple query like select * from table there will be no map reduce job is going to run as we are just dumping the data.
Hive# select * from foo; +---------+-----------+----------+--+ | foo.id | foo.name | foo.age | +---------+-----------+----------+--+ | 1 | a | 10 | | 2 | a | 10 | | 3 | b | 10 | | 4 | c | 20 | +---------+-----------+----------+--+ 4 rows selected (0.116 seconds)
you can use explain by adding before with your query, it will displays how the query is going to execute by execution engine and display how many map reduce phases are going to be done for the query.
Hive# explain select * from foo; +-------------------------------------------------------+--+ | Explain | +-------------------------------------------------------+--+ | Plan not optimized by CBO. | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Select Operator [SEL_5652] | | outputColumnNames:["_col0","_col1","_col2"] | | TableScan [TS_5651] | | alias:foo | | | +-------------------------------------------------------+--+
When ever you do aggregations then the reducer phase will be executed along with map phase.
Hive# select count(*) from table group by name; INFO : Map 1: 0/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 1/1 Reducer 2: 0/1 INFO : Map 1: 1/1 Reducer 2: 0(+1)/1 INFO : Map 1: 1/1 Reducer 2: 1/1 +------+--+ | _c0 | +------+--+ | 2 | | 1 | | 1 | +------+--+ 3 rows selected (13.709 seconds)
if you add Explain in front of above query it will displays
Hive# explain select count(*) from foo group by name; Reducer 2 <- Map 1 (SIMPLE_EDGE)as you can see reducer phase along with map phase.
we can add another reducer phase to above query by adding order by clause to it
Hive# select count(*) cnt from foo group by name order by cnt; INFO : Map 1: 0/1 Reducer 2: 0/2 Reducer 3: 0/1 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 0/1 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 0(+1)/1 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 1/1 Reducer 3: 0(+1)/1 INFO : Map 1: 1/1 Reducer 2: 1/1 Reducer 3: 1/1 +------+--+ | cnt | +------+--+ | 1 | | 1 | | 2 | +------+--+
you can see 2 reducer phases are done because after aggregating we are doing order by to the results
Map1 phase:- Loads the data from HDFS.
Reduer2:- Will does aggregation
Reducer 3:- after aggregation it will order the results to ascending order.
if you do explain on the above query
Hive# explain select count(*) cnt from foo group by name order by cnt;
Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
Created 10-16-2017 09:30 PM
@Shu Thank you for the explanation.
I wanted to know Hive queries (Hive sql) where there is no reducer phase at all, only mapper phase. Is there such an example ?
Created 10-16-2017 10:07 PM
As i'm having id,name,age columns in foo table when ever we does
Hive# select name from foo; //in this case first map phase will loads the file and we only selected name column, we are not doing any filtering kind of things here so map phase checks name field and gives results.
MapSideJoins:-
Usually all joins will perform on reducer side as we can explicitly mention load tables to memory and performs joins, no reducer phase will be initialized.
Hive# select /*+MAPJOIN(..)*/... //this kind of joins will loads small table to memory and does the join on map phase only.
When ever we do insert values into table and loading the data should be used only map phase.
Hive# insert into foo values(1,'abc',200); INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Table default.foo stats: [numFiles=5, numRows=5, totalSize=38, rawDataSize=33]
Simple CTAS without Aggregations:-
When we does Create table as simple select then only mapper phase will be initialized.
if we does any aggregations then reducer phase will get initialized
Hive#create table foo1 stored as orc as select * from foo
INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Table default.foo1 stats: [numFiles=1, numRows=4, totalSize=XXX, rawDataSize=XXXX] No rows affected (10.247 seconds)
Hive#select * from foo1; +----------+------------+-----------+--+ | foo1.id | foo1.name | foo1.age | +----------+------------+-----------+--+ | 1 | a | 10 | | 2 | a | 11 | | 2 | a | 10 | | 3 | b | 10 | | 4 | b | 10 | | 5 | c | 10 | +----------+------------+-----------+--+ 6 rows selected (0.205 seconds)
2. if we does CTAS with where clause in it still it is just map phase all the filters in WHERE clause are going to be done by mapper phase it self.
Hive#create table foo as select * from foo1 where id='1'; INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Table default.foo stats: [numFiles=1, numRows=1, totalSize=7, rawDataSize=6] No rows affected (9.984 seconds)
Hive#SELECT * FROM FOO; +---------+-----------+----------+--+ | foo.id | foo.name | foo.age | +---------+-----------+----------+--+ | 1 | a | 10 | +---------+-----------+----------+--+ 1 row selected (0.099 seconds)
Created 03-06-2018 05:50 PM
How is number of Mappers/reducers decided for a given query will be decided in runtime ?
Is it dependet on how many number of Joins or group by or order by clauses that are used in the query ?
If yes, then please let me know how many mappers and reducers are launched for the below query.
select name, count(*) as cnt from test group by name order by name;