Support Questions

MasterOfPuppets · ‎01-14-2017

Hello,

I've a table containing about 4 millions rows and all queries like:

select count(*) from table1 or select col1 from table1 where col2=xxx

runs always in more than 15 seconds.

The table has been stored as ORC with index enabled.

My Env. is Cloudera 5.8.2 and the engine execution is mapreduce. For some reason, TEZ engine does not work.

Any idea how making queries running faster ?

Thanks.

mathieu.d · ‎01-17-2017

Hive is not Oracle. You should not expect the same processing capabilities.

Hive is designed to run long and heavy queries. Whereas it performed poorly for small queries like the one you try to optimize.

Also note that Hive run on top of Yarn and per design Yarn takes time to instanciate containers and the JVM inside these containers. This should correspond to your question "why it takes so much time to start the Query Job".

If you want to get a quick reply for a basic count(*) without any filter/condition you might want to read about Hive statistics.

MasterOfPuppets · ‎01-17-2017

I think my query performs well.

From the Hadoop JobHistory Web Interface i see the following results for my query:

Submit Time : 2017.01.17 12:01:32 EST
Start Time : Tue Jan 17 12:01:35 EST 2017
Finish Time : 2017.01.17 12:01:36 EST

However the Hive Client renders data in seconds... what could be the bottleneck then ?

mbigelow · ‎01-17-2017

On the setting changes, stats, as stated will help with counts as that info is precalculates and stored in the metadata. The CBO and stats also help a lot with joins. It is possible that the OS cache is more to do with the improvement if this was a subsequent run with little activity.

You could look at Hive on Spark for better consistent performance.

Set hive.execution.engine = spark;

On the times, the big impact between job submission and start is the the scheduler. That is a deep topic. It is best if you read up on them and review your settings and ask any specific questions that come up, preferably in a new topic.

The other factor, not captured on the job stats, is the time it takes to return the results to the client. This will vary depending on the client and there isn't much to do about it. In general small result sets can be handle by the hive CLI. You can increase the client heap if needed. Otherwise use HS2 connections like beeline or HUE.

Cloudera Community

Support Questions

Hive Queries run slowly