Support Questions

pacosoplas · ‎06-01-2016

hi:

actually i have this query very slow, any suggestion to improve that?

select codinterno,sum(freq) as freq,codtxf 
from journey_oficina_v4 
where codnrbeenf = '3008' 
and (fechaoprcnf >= '2016-04-01 00:00:00.0' 
and fechaoprcnf < '2016-04-02 00:00:00.0' )
and codtxf = 'STR01OON' 
group by codinterno, codtxf 
order by freq desc

i just have a bucket for codnrbeenf and is not in the query so....

Thanks

rajkumar_singh · ‎06-01-2016

as you are using Transactional Table you can not take advantage of sort by on fechaoprcnf column.Apart from partitioning try to create storage index on the table using tblproperties ("orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000")

-- orc stripe and index.stride value in this case are default try to tune these value and compare performace results.

View solution in original post

rajkumar_singh · ‎06-01-2016

considering you are using orc table,if you are not using ACID table it will be good if you can modify the table DDL clustered by (codnrbeenf) sorted by (fechaoprcnf).

further to this you can create storage based index on orc table by specifying orc.create.index=true.

pacosoplas · ‎06-01-2016

Hi: the table i have created is like this, I use bucket because i want to use ACID operations, but there is any other good solution to use ACID operations and improve the query??

CREATE EXTERNAL TABLE IF NOT EXISTS journey_oficina_v4(
  FECHAOPRCNF timestamp,
  codnrbeenf string,
  CODINTERNO string,
  CODTXF string,
  FREQ BIGINT
  )
CLUSTERED BY (codnrbeenf) INTO 25 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as ORC
LOCATION '/RSI/tables/logs/journey_oficina_v4'
TBLPROPERTIES ("immutable"="false","transactional"="true")

pacosoplas · ‎06-01-2016

Hi:

waht about partitions, or index??? and waht about the grouo by??

jyadav · ‎06-01-2016

Hi @Roberto Sancho

Can you test the performance after disabling vectorization and running it on tez?

set hive.vectorized.execution.enabled=false;
set hive.vectorized.execution.reduce.enabled=false;
set hive.execution.engine=tez;

Also, please refer below doc for hive orc optimization.

https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance

rajkumar_singh · ‎06-01-2016

as you are using Transactional Table you can not take advantage of sort by on fechaoprcnf column.Apart from partitioning try to create storage index on the table using tblproperties ("orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000")

-- orc stripe and index.stride value in this case are default try to tune these value and compare performace results.

pacosoplas · ‎06-01-2016

Hi:

if i remove the transactional i cant modifi the table, so like this it will not improve nothing?

CREATE EXTERNAL TABLE IF NOT EXISTS journey_oficina_v5(
  FECHAOPRCNF timestamp,
  codnrbeenf string,
  CODINTERNO string,
  CODTXF string,
  FREQ BIGINT
  )
CLUSTERED BY (codnrbeenf) 
SORTED BY (FECHAOPRCNF)
INTO 25 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as ORC
LOCATION '/RSI/tables/logs/journey_oficina_v5'
TBLPROPERTIES ("immutable"="false","transactional"="true", "orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000");

bleonhardi · ‎06-01-2016

Forget indexes but partitions would help a lot . I normally expect Hive tables to be partitioned by date. So why don't you do that. You may need to add another column with an integer day column. Also ORC files are much faster than delimited files.

Finally clustering by code doesn't help you much since you don't use ORC. So there is no predicate pushdown at all and no performance enhancement. Even if you use ORCs you might not get benefits since you would have one file with all rows of your code and 24 files ( mappers) that would close immediately. So it might be better to cluster by something else and add a sorted by code keyword. However as said that only helps you with ORCs.

pacosoplas · ‎06-01-2016

hi:

Finally i created it and iam doing ETL of 3 month of data, and i will see if the query have been improved

CREATE EXTERNAL TABLE IF NOT EXISTS journey_oficina(
  FECHAOPRCNF timestamp,
  codnrbeenf string,
  CODINTERNO string,
  CODTXF string,
  FREQ BIGINT
  )
CLUSTERED BY (codnrbeenf) INTO 60 BUCKETS
SORTED BY (FECHAOPRCNF)
stored as ORC
LOCATION '/RSI/tables/logs/journey_oficina'
TBLPROPERTIES ("immutable"="false","transactional"="true", "orc.create.index"="true","orc.compress"="ZLIB", "orc.stripe.size"="67108864","orc.row.index.stride"="25000");

its ok???

rajkumar_singh · ‎06-01-2016

with "transactional"="true" you are not able to compile this DDL statement, transactional table wont allow sorted column, are you able to successfully execute this statement?

Cloudera Community

Support Questions

slow query in hive

Slow query in Hive

Simple Select slow query in Hive

Investigating when the queries on LLAP are slow or...

Importing and Querying JSON data in Hive

Hive Query Recovery Tactics: Handling Failures thr...

Query Runs slow on hive when using NOT LIKE

ExecuteSQL query is very slow

Spark to support REGEX column specification for Hi...

Impala Partition Query running Slow

Tuning Large Hive Queries - Part 1