Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive Queries run slowly

Re: Hive Queries run slowly

Super Collaborator

Hive is not Oracle. You should not expect the same processing capabilities.

 

Hive is designed to run long and heavy queries. Whereas it performed poorly for small queries like the one you try to optimize.

Also note that Hive run on top of Yarn and per design Yarn takes time to instanciate containers and the JVM inside these containers. This should correspond to your question "why it takes so much time to start the Query Job".

 

If you want to get a quick reply for a basic count(*) without any filter/condition you might want to read about Hive statistics.

 

 

 

Re: Hive Queries run slowly

Contributor

I think my query performs well.

From the Hadoop JobHistory Web Interface i see the following results for my query:

Submit Time : 2017.01.17 12:01:32 EST
Start Time : Tue Jan 17 12:01:35 EST 2017
Finish Time : 2017.01.17 12:01:36 EST

However the Hive Client renders data in seconds... what could be the bottleneck then ?

Highlighted

Re: Hive Queries run slowly

Champion
On the setting changes, stats, as stated will help with counts as that info is precalculates and stored in the metadata. The CBO and stats also help a lot with joins. It is possible that the OS cache is more to do with the improvement if this was a subsequent run with little activity.

You could look at Hive on Spark for better consistent performance.

Set hive.execution.engine = spark;

On the times, the big impact between job submission and start is the the scheduler. That is a deep topic. It is best if you read up on them and review your settings and ask any specific questions that come up, preferably in a new topic.

The other factor, not captured on the job stats, is the time it takes to return the results to the client. This will vary depending on the client and there isn't much to do about it. In general small result sets can be handle by the hive CLI. You can increase the client heap if needed. Otherwise use HS2 connections like beeline or HUE.