Created on 07-08-2019 06:13 AM - edited 09-16-2022 07:29 AM
Hello,
I need to improve speed for my select query currently I am testing how best I can get from Impala with given hardware.
I have created a parquet table with partitions and block size = 512mb.
Here is some more details -
[quickstart.cloudera:21000] > show table stats tbl_parq_512m; Query: show table stats tbl_parq_512m +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | 4999950 | 1 | 337.76MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 | | 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | 19999800 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 | | 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | 19999800 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 | | 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | 48999020 | 7 | 3.23GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 | | 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 | | 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 | | 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | 99998000 | 14 | 6.59GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 | | 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 | | 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 | | 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 | | 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | 82998340 | 11 | 5.47GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 | | Total | | | | 526989910 | 74 | 34.74GB | 0B | | | | | +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+---------------------------------------------------------
When I run the below query , it takes 28 seconds to complete -
[quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10; Query: select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10 Query submitted at: 2019-07-08 18:32:31 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=88422271f161227b:fe860ef500000000 +-----------+----------------+-------------------+-----------+ | source_ip | destination_ip | avg(volume) | avl | +-----------+----------------+-------------------+-----------+ | 10.0.0.1 | 20.0.0.1 | 4.499997599976 | 22500415 | | 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | 90003989 | | 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | 90023641 | | 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | 220487564 | | 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | 224966377 | | 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | 224977244 | | 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | 225002743 | | 56.0.0.1 | 60.0.0.1 | 4.500426968539371 | 225003328 | | 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | 225013849 | | 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | 373535589 | +-----------+----------------+-------------------+-----------+ Fetched 10 row(s) in 27.17s
I am trying to speed up the query so I tried going with gzip compression.
[quickstart.cloudera:21000] > show table stats tbl_gzip_512m; Query: show table stats tbl_gzip_512m +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | 5000950 | 1 | 231.85MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 | | 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | 19999800 | 2 | 925.94MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 | | 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | 19999800 | 2 | 925.94MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 | | 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | 48999020 | 5 | 2.22GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 | | 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 | | 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 | | 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | 99998000 | 19 | 4.53GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 | | 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 | | 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | 49999000 | 10 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 | | 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | 49999000 | 10 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 | | 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | 82998340 | 16 | 3.76GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 | | Total | | | | 526990910 | 80 | 23.84GB | 0B | | | | | +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+---------------------------------------------------------
Size has dropped by 30% , it has reduced from 34gb to 24 gb.
I was expecting similar behaviour with query performance as well hoping it will reduce some more time.
Below is my query query-
[quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_gzip_512m group by source_ip,destination_ip order by avl limit 10; Query: select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_gzip_512m group by source_ip,destination_ip order by avl limit 10 Query submitted at: 2019-07-08 18:39:09 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=8d4bdddc207517ff:952aea9d00000000 +-----------+----------------+-------------------+-----------+ | source_ip | destination_ip | avg(volume) | avl | +-----------+----------------+-------------------+-----------+ | 10.0.0.1 | 20.0.0.1 | 4.50001219768244 | 22504983 | | 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | 90003989 | | 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | 90023641 | | 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | 220487564 | | 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | 224966377 | | 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | 224977244 | | 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | 225002743 | | 56.0.0.1 | 60.0.0.1 | 4.500426968539371 | 225003328 | | 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | 225013849 | | 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | 373535589 | +-----------+----------------+-------------------+-----------+ Fetched 10 row(s) in 27.67s
As you can see there isn't any improvement moreover it looks little slow despite decrease in data by 1/3rd.
So am I doing something wrong here when using gzip ?
Thanks
Created 07-15-2019 03:35 AM
Created 07-15-2019 03:35 AM