Support Questions

# Question on tez dag task and pig on tez

Expert Contributor

I have couple of basic questions on tez dag task and map reduce jobs and pig running on tez.

1. My understanding is that each vertex of the tez task can run either mapper or reducer and not both, is this right?

2. Each vertex of tez task can have its own parallelism. Is this right? Say if there are three vertices, one map and two reducer vertices, then can each reducer vertices run with different parallelism? How to control this?

3. When I see pig explain plan running on tez, I see vertices and operations on them, but I dont see the parallelism for each vertex in the explain plan, I see it only when I dump the relation. How to see the parallelism of each vertex in explain plan of pig?

4. If I use parallel clause to control the number of reducers in the pig in tez mode, does it control the parallelism of the vertices running only the reducers? and does it affect all nodes runinng the reducers? Is there a way to control the number of parallelism of each vertex separately?

5. If there are 4 splits of file, then there would ideally be 4 mappers right? In this case, in tez would there be 4 vertices each running one mapper or one vertex running 4 mappers?

6. How to control the number of mappers (or the parallelism of vertex running the mappers)?

7. When the pig command is running, I see the number of total tasks, but how to find the number of tasks in each vertex?

1 ACCEPTED SOLUTION
Guru

Quick Prefix:

One "vertex" is a more general form of the Map and Reduce stages in MapReduce. In mapreduce you can only have 2 stages and complex pig jobs result in multiple mapreduce jobs running after each other. In Tez multiple stages ( vertexes ) can be merged into the same job. So think of a vertex as of a map or reduce stage.

Each vertex can have multiple tasks. In pig the same rules apply for the number here as in MapReduce (see links below). However sometimes its a bit difficult to configure two reducer stages independently so normally parameters are used like "Nr of mb per reducer" and pig then tries to compute the number of tez tasks based on the output size of the previous stage/vertex. You can also hard set it but then all reducer vertexes have the same number which is not always what you want.

1) yes

2) yes

How to control this: Same as MapReduce. This link for reducers

https://pig.apache.org/docs/r0.11.1/perf.html#reducer-estimation

and this link for mappers:

https://pig.apache.org/docs/r0.11.1/perf.html#combine-files

3) Not sure honestly, have you tried the tez view?

4) Not sure what you mean, one node as in one server? Or in one container. One container = one task multiple containers = one vertex, to control their parallelity see above

5) One vertex running 4 mappers, unless the files are small then they are combined see link above

6) See above ( Might misunderstand the question seems to be the same as above )

7) Again Tez view in ambari might help. In Hive there is a parameter set hive.tez.exec.print.summary ( or hive.tez.print.exec.summary? ) which shows you all of that. No idea if something like that is available in pig

4 REPLIES 4
Guru

Quick Prefix:

One "vertex" is a more general form of the Map and Reduce stages in MapReduce. In mapreduce you can only have 2 stages and complex pig jobs result in multiple mapreduce jobs running after each other. In Tez multiple stages ( vertexes ) can be merged into the same job. So think of a vertex as of a map or reduce stage.

Each vertex can have multiple tasks. In pig the same rules apply for the number here as in MapReduce (see links below). However sometimes its a bit difficult to configure two reducer stages independently so normally parameters are used like "Nr of mb per reducer" and pig then tries to compute the number of tez tasks based on the output size of the previous stage/vertex. You can also hard set it but then all reducer vertexes have the same number which is not always what you want.

1) yes

2) yes

How to control this: Same as MapReduce. This link for reducers

https://pig.apache.org/docs/r0.11.1/perf.html#reducer-estimation

and this link for mappers:

https://pig.apache.org/docs/r0.11.1/perf.html#combine-files

3) Not sure honestly, have you tried the tez view?

4) Not sure what you mean, one node as in one server? Or in one container. One container = one task multiple containers = one vertex, to control their parallelity see above

5) One vertex running 4 mappers, unless the files are small then they are combined see link above

6) See above ( Might misunderstand the question seems to be the same as above )

7) Again Tez view in ambari might help. In Hive there is a parameter set hive.tez.exec.print.summary ( or hive.tez.print.exec.summary? ) which shows you all of that. No idea if something like that is available in pig

Expert Contributor

Hi Ben, Thanks for taking time to explain each of these questions.

For qn. 4, I actually meant to type Vertex, but instead I mentioned it as node. What I meant to ask was, by setting the number of reducers, we affect all the vertices that run only the reducers.

Based on your explanation. I think it affects all the vertices running reducers. And it would not affect any vertex running mappers or combination of mappers and combiners. Right?

By mapper and reducer, my understanding was that any class extending Mapper is mapper and any class extending Reducer is reducer. Just out of curiosity when I look into the pig source code, there are many operators like PORank, POFRJoin etc.. and these are the ones that are showed in explain plan also as tasks of each vertex. So essentially in Tez DAG pig latin gets converted to these operators right? Are these operators run as part of Mapper and reducers?

So irrespective of the underlying task being a true mapper or reducer class or one of the tez pig operators, is it correct to assume that that the parallelism of root vertices which read the data from file or table to be controlled based on file split or table parttions and the leaf vertices and other vertices in between are all like reducers and its parallelism is controlled by reducers properties? like number of reducers or bytes per reducers?

And if I write a UDF, is it possible to identify if it is run inside mapper class or reducer class?

Guru

The thing is that Pig has an abstraction layer between the operators and the actual implementation. Tez per se does not need to be a mapper or a reducer. It by definition is more flexible. However since Hive and Pig have been written with the map/reduce model in mind this was kept for the compilation into tez. After all the underlying needs didn't change too much. you still need Mappers for data transforms and Reducers for group bys, joins , ...

In general I would look into the Tez view to find the details of the tasks.