Created 06-01-2016 09:34 AM
hi:
actually i have this query very slow, any suggestion to improve that?
select codinterno,sum(freq) as freq,codtxf from journey_oficina_v4 where codnrbeenf = '3008' and (fechaoprcnf >= '2016-04-01 00:00:00.0' and fechaoprcnf < '2016-04-02 00:00:00.0' ) and codtxf = 'STR01OON' group by codinterno, codtxf order by freq desc
i just have a bucket for codnrbeenf and is not in the query so....
Thanks
Created 06-01-2016 10:09 AM
as you are using Transactional Table you can not take advantage of sort by on fechaoprcnf column.Apart from partitioning try to create storage index on the table using tblproperties ("orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000")
-- orc stripe and index.stride value in this case are default try to tune these value and compare performace results.
Created 06-01-2016 09:44 AM
considering you are using orc table,if you are not using ACID table it will be good if you can modify the table DDL clustered by (codnrbeenf) sorted by (fechaoprcnf).
further to this you can create storage based index on orc table by specifying orc.create.index=true.
Created 06-01-2016 09:46 AM
Hi: the table i have created is like this, I use bucket because i want to use ACID operations, but there is any other good solution to use ACID operations and improve the query??
CREATE EXTERNAL TABLE IF NOT EXISTS journey_oficina_v4( FECHAOPRCNF timestamp, codnrbeenf string, CODINTERNO string, CODTXF string, FREQ BIGINT ) CLUSTERED BY (codnrbeenf) INTO 25 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' stored as ORC LOCATION '/RSI/tables/logs/journey_oficina_v4' TBLPROPERTIES ("immutable"="false","transactional"="true")
Created 06-01-2016 09:51 AM
Hi:
waht about partitions, or index??? and waht about the grouo by??
Created 06-01-2016 10:07 AM
Can you test the performance after disabling vectorization and running it on tez?
set hive.vectorized.execution.enabled=false; set hive.vectorized.execution.reduce.enabled=false; set hive.execution.engine=tez;
Also, please refer below doc for hive orc optimization.
https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance
Created 06-01-2016 10:09 AM
as you are using Transactional Table you can not take advantage of sort by on fechaoprcnf column.Apart from partitioning try to create storage index on the table using tblproperties ("orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000")
-- orc stripe and index.stride value in this case are default try to tune these value and compare performace results.
Created 06-01-2016 10:19 AM
Hi:
if i remove the transactional i cant modifi the table, so like this it will not improve nothing?
CREATE EXTERNAL TABLE IF NOT EXISTS journey_oficina_v5( FECHAOPRCNF timestamp, codnrbeenf string, CODINTERNO string, CODTXF string, FREQ BIGINT ) CLUSTERED BY (codnrbeenf) SORTED BY (FECHAOPRCNF) INTO 25 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' stored as ORC LOCATION '/RSI/tables/logs/journey_oficina_v5' TBLPROPERTIES ("immutable"="false","transactional"="true", "orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000");
Created 06-01-2016 10:47 AM
Forget indexes but partitions would help a lot . I normally expect Hive tables to be partitioned by date. So why don't you do that. You may need to add another column with an integer day column. Also ORC files are much faster than delimited files.
Finally clustering by code doesn't help you much since you don't use ORC. So there is no predicate pushdown at all and no performance enhancement. Even if you use ORCs you might not get benefits since you would have one file with all rows of your code and 24 files ( mappers) that would close immediately. So it might be better to cluster by something else and add a sorted by code keyword. However as said that only helps you with ORCs.
Created 06-01-2016 10:59 AM
hi:
Finally i created it and iam doing ETL of 3 month of data, and i will see if the query have been improved
CREATE EXTERNAL TABLE IF NOT EXISTS journey_oficina( FECHAOPRCNF timestamp, codnrbeenf string, CODINTERNO string, CODTXF string, FREQ BIGINT ) CLUSTERED BY (codnrbeenf) INTO 60 BUCKETS SORTED BY (FECHAOPRCNF) stored as ORC LOCATION '/RSI/tables/logs/journey_oficina' TBLPROPERTIES ("immutable"="false","transactional"="true", "orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="67108864","orc.row.index.stride"="25000");
its ok???
Created 06-01-2016 11:06 AM
with "transactional"="true" you are not able to compile this DDL statement, transactional table wont allow sorted column, are you able to successfully execute this statement?