Support Questions

Find answers, ask questions, and share your expertise

slow query in hive

avatar
Master Collaborator

hi:

actually i have this query very slow, any suggestion to improve that?

select codinterno,sum(freq) as freq,codtxf 
from journey_oficina_v4 
where codnrbeenf = '3008' 
and (fechaoprcnf >= '2016-04-01 00:00:00.0' 
and fechaoprcnf < '2016-04-02 00:00:00.0' )
and codtxf = 'STR01OON' 
group by codinterno, codtxf 
order by freq desc

i just have a bucket for codnrbeenf and is not in the query so....

Thanks

1 ACCEPTED SOLUTION

avatar
Super Guru

as you are using Transactional Table you can not take advantage of sort by on fechaoprcnf column.Apart from partitioning try to create storage index on the table using tblproperties ("orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000")

-- orc stripe and index.stride value in this case are default try to tune these value and compare performace results.

View solution in original post

17 REPLIES 17

avatar
Super Guru

considering you are using orc table,if you are not using ACID table it will be good if you can modify the table DDL clustered by (codnrbeenf) sorted by (fechaoprcnf).

further to this you can create storage based index on orc table by specifying orc.create.index=true.

avatar
Master Collaborator

Hi: the table i have created is like this, I use bucket because i want to use ACID operations, but there is any other good solution to use ACID operations and improve the query??

CREATE EXTERNAL TABLE IF NOT EXISTS journey_oficina_v4(
  FECHAOPRCNF timestamp,
  codnrbeenf string,
  CODINTERNO string,
  CODTXF string,
  FREQ BIGINT
  )
CLUSTERED BY (codnrbeenf) INTO 25 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as ORC
LOCATION '/RSI/tables/logs/journey_oficina_v4'
TBLPROPERTIES ("immutable"="false","transactional"="true")

avatar
Master Collaborator

Hi:

waht about partitions, or index??? and waht about the grouo by??

avatar
Super Guru

Hi @Roberto Sancho

Can you test the performance after disabling vectorization and running it on tez?

set hive.vectorized.execution.enabled=false;
set hive.vectorized.execution.reduce.enabled=false;
set hive.execution.engine=tez;

Also, please refer below doc for hive orc optimization.

https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance

avatar
Super Guru

as you are using Transactional Table you can not take advantage of sort by on fechaoprcnf column.Apart from partitioning try to create storage index on the table using tblproperties ("orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000")

-- orc stripe and index.stride value in this case are default try to tune these value and compare performace results.

avatar
Master Collaborator

Hi:

if i remove the transactional i cant modifi the table, so like this it will not improve nothing?

CREATE EXTERNAL TABLE IF NOT EXISTS journey_oficina_v5(
  FECHAOPRCNF timestamp,
  codnrbeenf string,
  CODINTERNO string,
  CODTXF string,
  FREQ BIGINT
  )
CLUSTERED BY (codnrbeenf) 
SORTED BY (FECHAOPRCNF)
INTO 25 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as ORC
LOCATION '/RSI/tables/logs/journey_oficina_v5'
TBLPROPERTIES ("immutable"="false","transactional"="true", "orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000");

avatar
Master Guru

Forget indexes but partitions would help a lot . I normally expect Hive tables to be partitioned by date. So why don't you do that. You may need to add another column with an integer day column. Also ORC files are much faster than delimited files.

Finally clustering by code doesn't help you much since you don't use ORC. So there is no predicate pushdown at all and no performance enhancement. Even if you use ORCs you might not get benefits since you would have one file with all rows of your code and 24 files ( mappers) that would close immediately. So it might be better to cluster by something else and add a sorted by code keyword. However as said that only helps you with ORCs.

avatar
Master Collaborator

hi:

Finally i created it and iam doing ETL of 3 month of data, and i will see if the query have been improved

CREATE EXTERNAL TABLE IF NOT EXISTS journey_oficina(
  FECHAOPRCNF timestamp,
  codnrbeenf string,
  CODINTERNO string,
  CODTXF string,
  FREQ BIGINT
  )
CLUSTERED BY (codnrbeenf) INTO 60 BUCKETS
SORTED BY (FECHAOPRCNF)
stored as ORC
LOCATION '/RSI/tables/logs/journey_oficina'
TBLPROPERTIES ("immutable"="false","transactional"="true", "orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="67108864","orc.row.index.stride"="25000");

its ok???

avatar
Super Guru

with "transactional"="true" you are not able to compile this DDL statement, transactional table wont allow sorted column, are you able to successfully execute this statement?