About bleonhardi

bleonhardi · ‎08-04-2016

Why don't you want a hive external table? It is just a temporary entry without any significant overhead. You can also use OrcStorage in pig to write orc files directly. http://pig.apache.org/docs/r0.15.0/func.html#OrcStorage Similar functions are available for spark Or you might be able to write a custom Mapreduce function using an ORC outputformat.

bleonhardi · ‎08-04-2016

Puh I think you need to make that decision yourself. A single big application will always be more efficient i.e. faster. You can also modularize a Spark project so working on a single task doesn't change the code from the others. However it becomes complex and you as said need to stop/start the application whenever you make a change to any part of it. Also if you use something CPU heavy as Tika the overhead of a separate topic in the middle doesn't sound too big anymore. So I think I would also strive for something like Input sources -> Parsing, Tika, normalization -> Kafka with a normalized document format ( JSON? ) -> Analytical application. But its hard to make an intelligent answer from a mile away 🙂 .

bleonhardi · ‎08-04-2016

Yeah see above. I think you just have to have a client like Hive that opens a TezClient, creates an Application master and then submits more DAGs to it. Specifically in Hive you have per default one Tez session per jdbc connection. So if you run multiple queries over the same jdbc connection they use the same Tez client, same Tez session and as long as the timeout is not reached the same application master. Yes I think it sounds a bit more magical than it is, the reuse is just the session mode where the client can send multiple DAGs to the same Tez AM. As said in LLAP you will have shared long running processes that can be discovered so its a bit different. http://hortonworks.com/blog/introducing-tez-sessions/

bleonhardi · ‎08-04-2016

Yeah I have to say I didn't look into the hive code so I am not sure if you can actually "find" running Tez applications and attach to them. I think its just the TezClient being kept open in hive server/ pig whatever and then submitting more DAGs to the existing AM. But there might be ways for discovery. But basically Tez doesn't take over much of what yarn does. This will be a bit different with LLAP. Which is like a big yarn container running multiple tez tasks. That one will have some workload management, scheduling etc. https://tez.apache.org/releases/0.7.1/tez-api-javadocs/org/apache/tez/client/TezClient.html

bleonhardi · ‎08-03-2016

So he is definitely correct if you have a single application when you purely look at performance. You essentially get rid of the second kafka overhead. However there are situations where a second set of queues can be helpful: You decouple parsing and cleansing from analysis which has big advantages: - So you can have one parser app and then multiple analytical applications you can start and stop as you please without impacting the other parser and analytical apps. - You can write simple analytical applications that take a parsed, cleaned subset of data they are interested in so people can consume data they actually want and don't have to worry about the initial parsing/cleansing etc.

bleonhardi · ‎08-03-2016

yeah but without the error we cannot really help. I suppose you mean a classnotfound exception? So your udf uses a lot of exotic imports?

bleonhardi · ‎08-03-2016

Hi Shiva, Its a Tez client API call you would need to do to find already existing Application Masters of your user in the cluster. You can then hook up with them. The main user at the moment is Hive. Which utilizes it to reduce the startup cost of a query. Essentially each JDBC connection of hive session ( if enabled ) map to one application master in yarn. So when you run a query hive will check if an application master already exists ( using the tez client api calls ) and uses that AM. Or creates a new one otherwise.

bleonhardi · ‎08-01-2016

Hive was essentially a java library that kicks off MapReduce jobs. So the hive cli for example runs a full "server" in its client. If you have multiple clients all of them do their own SQL parsing/optimization etc. In Hive1 there was a thrift server which was like a proxy server for this. So a thrift ( data serialization framework ) client could connect to it instead of doing all the computations locally. All of that is not relevant anymore since Hiveserver2 has been the default for many years in all distributions and is a proper database server with concurrency/security/logging/workload management... You still have the hive client available but this will be deprecated soon in favor of beeline which is a command line client that connects to hiveserver2 through jdbc. This is desirable since the hive cli punches a lot of holes into central hive administration. So forget about hiveserver1 thrift server and thrift client.

bleonhardi · ‎08-01-2016

@Steven Hirsch I think you can try it for one application, one possibility is to simply switch off ATS for a bit that helped me once but not a second time ( Tez still tries to log to it ) So if you really want to switch it off completely you can add the following settings: tez.history.logging.service.class = org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService and to see the logs: tez.simple.history.logging.dir=/mylogdir Also removing the following ATSHooks hive.exec.pre.hooks= hive.exec.post.hooks= Also potentially reduce log levels hive.tez.log.level=ERROR And see if it makes things faster. Again if you don't see a difference you may have other issues. But its worth to rule out. ATS 1.5 has been enabled in HDP2.4. Also ATS1.0 has some tuning options. If that is really your bottleneck Hortonworks support may be able to help.

bleonhardi · ‎08-01-2016

Again see below the block size doesn't really change anything apart from the number of total tasks. You still have the same number of tasks running at the same time which are defined by the executor-cores. Can you do a top on the machines while the job is runnoing? If your CPU utilization is low you want to increase the executor-cores if they are close to max. reducing it a bit might be working ( to reduce task switching ) . However if that doesn't change anything. you might just have to live with the prformance. Or buy more nodes.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: How to load CSV file directly into Hive ORC ta...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: How new DAGs are submitted to existing Tez App...

Re: Kafka for queue to spark

Re: Big Data Analytics - Approach for Data Quality...

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: How many HIVE concurrent queries can be execut...

Re: Tuning parallelism: increase or decrease?