Created on 01-14-2017 05:55 PM - edited 09-16-2022 03:54 AM
Hello,
I've a table containing about 4 millions rows and all queries like:
select count(*) from table1 or select col1 from table1 where col2=xxx
runs always in more than 15 seconds.
The table has been stored as ORC with index enabled.
My Env. is Cloudera 5.8.2 and the engine execution is mapreduce. For some reason, TEZ engine does not work.
Any idea how making queries running faster ?
Thanks.
Created on 01-17-2017 08:27 AM - edited 01-17-2017 08:27 AM
Hive is not Oracle. You should not expect the same processing capabilities.
Hive is designed to run long and heavy queries. Whereas it performed poorly for small queries like the one you try to optimize.
Also note that Hive run on top of Yarn and per design Yarn takes time to instanciate containers and the JVM inside these containers. This should correspond to your question "why it takes so much time to start the Query Job".
If you want to get a quick reply for a basic count(*) without any filter/condition you might want to read about Hive statistics.
Created 01-17-2017 09:20 AM
I think my query performs well.
From the Hadoop JobHistory Web Interface i see the following results for my query:
Submit Time : 2017.01.17 12:01:32 EST
Start Time : Tue Jan 17 12:01:35 EST 2017
Finish Time : 2017.01.17 12:01:36 EST
However the Hive Client renders data in seconds... what could be the bottleneck then ?
Created 01-17-2017 11:50 AM