Created 08-03-2016 12:37 PM
As per Tez sessions, DAGs submitted within a session are handled by the same AppMaster. Unable to understand how the new application (DAG) is mapped to the already running AppMaster?? Who does it and how?? As per YARN the resource manager is responsible for launching appmasters. How this functionality is eclipsed by Tez??
Thanks in advance.
Created 08-03-2016 02:53 PM
Hi Shiva,
Its a Tez client API call you would need to do to find already existing Application Masters of your user in the cluster. You can then hook up with them. The main user at the moment is Hive. Which utilizes it to reduce the startup cost of a query. Essentially each JDBC connection of hive session ( if enabled ) map to one application master in yarn. So when you run a query hive will check if an application master already exists ( using the tez client api calls ) and uses that AM. Or creates a new one otherwise.
Created 08-03-2016 02:53 PM
Hi Shiva,
Its a Tez client API call you would need to do to find already existing Application Masters of your user in the cluster. You can then hook up with them. The main user at the moment is Hive. Which utilizes it to reduce the startup cost of a query. Essentially each JDBC connection of hive session ( if enabled ) map to one application master in yarn. So when you run a query hive will check if an application master already exists ( using the tez client api calls ) and uses that AM. Or creates a new one otherwise.
Created 08-04-2016 04:58 AM
So, the handshake between client and AppMaster in YARN(which decommissions once job is done) is continued here in a Tez session. and client submits new DAGs directly to AppMaster and resource manager thinks its still the same application running , so the DAGs run with same application id.
Correct me if i am wrong.
Created 08-04-2016 09:27 AM
Yeah I have to say I didn't look into the hive code so I am not sure if you can actually "find" running Tez applications and attach to them. I think its just the TezClient being kept open in hive server/ pig whatever and then submitting more DAGs to the existing AM. But there might be ways for discovery. But basically Tez doesn't take over much of what yarn does. This will be a bit different with LLAP. Which is like a big yarn container running multiple tez tasks. That one will have some workload management, scheduling etc.
https://tez.apache.org/releases/0.7.1/tez-api-javadocs/org/apache/tez/client/TezClient.html
Created 08-04-2016 09:21 AM
As per YARN appMaster is a mere code. So am unable to figure out how the new DAG can be submitted to existing AppMaster written to handle some other DAG.
Created 08-04-2016 09:29 AM
Yeah see above. I think you just have to have a client like Hive that opens a TezClient, creates an Application master and then submits more DAGs to it. Specifically in Hive you have per default one Tez session per jdbc connection. So if you run multiple queries over the same jdbc connection they use the same Tez client, same Tez session and as long as the timeout is not reached the same application master.
Yes I think it sounds a bit more magical than it is, the reuse is just the session mode where the client can send multiple DAGs to the same Tez AM. As said in LLAP you will have shared long running processes that can be discovered so its a bit different.
Created 08-04-2016 06:28 AM
The Tez that are aware of the whole DAG of operations can do better global optimizations than systems like Hadoop MapReduce which are unaware of the DAG to be executed.
While this is the theory, different systems implement this theory in different ways, and that is where the "advantages" and "disadvantages" come from. Computations expressed in Hadoop MapReduce boil down to multiple iterations of
(i) read data from HDFS,
(ii) apply map and reduce,
(iii) write back to HDFS. Each map-reduce round is completely independent of each other
Hadoop does not have any global knowledge of what MR steps are going to come after each MR. For many iterative algorithms this is inefficient as the data between each map-reduce pair gets written and read from filesystem. Newer systems like Tez improves performance over Hadoop by considering the whole DAG of map-reduce steps and optimizing it globally (e.g., pipelining consecutive map steps into one, not write intermediate data to HDFS). This prevents writing data back and forth after every reduce.
If this is helpfull,your close appriciated.
Created 08-04-2016 09:18 AM
Thank you @Shiv kumar