Created on 07-01-2019 07:02 AM - edited 09-16-2022 07:28 AM
Hello,
After many days of POC and some tuning I am getting some better results however the performance seems less then what I was expecting. I understand the need of better hardware as Impala is meant for bigdata however I also need to know its performance on some regular systems.
For this I created a parquet partitioned table and filled it with 500million data, I feel query is taking twice the time it should take.
I had set PARQUET_FILE_SIZE = 512MB
Details of my table -
tbl_parq_512m
+------------------------------+-------------------------------------------------------------------+----------------------+ | name | type | comment | +------------------------------+-------------------------------------------------------------------+----------------------+ | # col_name | data_type | comment | | | NULL | NULL | | protocol_number | int | NULL | | source_port | int | NULL | | destination_port | int | NULL | | packet | int | NULL | | volume | int | NULL | | duration | int | NULL | | pps | int | NULL | | bps | int | NULL | | bpp | int | NULL | | source_latitude | string | NULL | | source_longitude | string | NULL | | source_city | string | NULL | | source_country | string | NULL | | destination_latitude | string | NULL | | destination_longitude | string | NULL | | destination_city | string | NULL | | destination_country | string | NULL | | ingress_volume | int | NULL | | egress_volume | int | NULL | | ingress_packet | int | NULL | | egress_packet | int | NULL | | source_if_index | int | NULL | | destination_if_index | int | NULL | | source_host | string | NULL | | flow | int | NULL | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | | NULL | NULL | | source_ip | string | NULL | | destination_ip | string | NULL | | year | int | NULL | | event_date | string | NULL | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | Owner: | root | NULL | | CreateTime: | Fri Jun 28 17:08:24 IST 2019 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Protect Mode: | None | NULL | | Retention: | 0 | NULL | | Location: | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | STATS_GENERATED_VIA_STATS_TASK | true | | | transient_lastDdlTime | 1561734077 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL | | InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | 0 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | +------------------------------+-------------------------------------------------------------------+----------------------+
My data using table stats-
[quickstart.cloudera:21000] > show table stats tbl_parq_512m; Query: show table stats tbl_parq_512m +-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | -1 | 1 | 337.76MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 | | 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | -1 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 | | 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | -1 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 | | 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | -1 | 7 | 3.23GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 | | 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 | | 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 | | 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | -1 | 14 | 6.59GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 | | 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 | | 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 | | 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 | | 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | -1 | 11 | 5.47GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 | | Total | | | | -1 | 74 | 34.74GB | 0B | | | | | +-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
Block size details -
[root@quickstart home]# hdfs fsck -blocks /user/hive/warehouse/tbl_parq_512m; Connecting to namenode via http://quickstart.cloudera:50070/fsck?ugi=root&blocks=1&path=%2Fuser%2Fhive%2Fwarehouse%2Ftbl_parq_512m FSCK started by root (auth:SIMPLE) from /172.16.8.177 for path /user/hive/warehouse/tbl_parq_512m at Tue Jul 02 00:47:19 IST 2019 ..........................................................................Status: HEALTHY Total size: 37305601105 B Total dirs: 46 Total files: 74 Total symlinks: 0 Total blocks (validated): 74 (avg. block size 504129744 B) Minimally replicated blocks: 74 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 1 Average block replication: 1.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 1 Number of racks: 1 FSCK ended at Tue Jul 02 00:47:19 IST 2019 in 15 milliseconds The filesystem under path '/user/hive/warehouse/tbl_parq_512m' is HEALTHY
My query and result-
[quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10; Query: select source_ip,destination_ip,avg(volume) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10 Query submitted at: 2019-07-01 18:24:10 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=fd40c639c5e21365:3cbcfc1400000000 +-----------+----------------+-------------------+ | source_ip | destination_ip | avl | +-----------+----------------+-------------------+ | 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | | 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | | 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | | 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | | 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | | 10.0.0.1 | 20.0.0.1 | 4.499997599976 | | 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | | 61.0.0.1 | 70.0.0.1 | 4.500189103782076 | | 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | | 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | +-----------+----------------+-------------------+ Fetched 10 row(s) in 34.70s
As we can see its takes on average 34 seconds however I was expecting it to be quicker then this.
I am using the latest VM shipped and Cloudera Express.
I have increased my RAM to 10gb and harddisk to 250gb.
What more tunning can be done to improve performance, Do I need to tune CentOS ?
Thanks
Created on 07-03-2019 02:49 AM - edited 07-03-2019 02:54 AM
HI @punshi
Do you use the QuickStart VM for CDH 5.13?
Indeed, the VM is just for testing !! To unlock the real power of the impala (CDH), you should have a cluster so that you can benefit from the senergy of several nodes.
Anyway, you can improve your query time by:
1- Set PARQUET_FILE_SIZE = 256MB instead of 512MB.
2- Try to minimize the number of partitions (in fact, I think in your case, the year is sufficient).
NB: I think that in a cluster that has more than 10 nodes, this request will not exceed 2 to 4 seconds.
Good luck.
Created on 07-03-2019 02:49 AM - edited 07-03-2019 02:54 AM
HI @punshi
Do you use the QuickStart VM for CDH 5.13?
Indeed, the VM is just for testing !! To unlock the real power of the impala (CDH), you should have a cluster so that you can benefit from the senergy of several nodes.
Anyway, you can improve your query time by:
1- Set PARQUET_FILE_SIZE = 256MB instead of 512MB.
2- Try to minimize the number of partitions (in fact, I think in your case, the year is sufficient).
NB: I think that in a cluster that has more than 10 nodes, this request will not exceed 2 to 4 seconds.
Good luck.
Created 07-03-2019 07:59 AM
Thanks @AcharkiMed
I tried that there was no improvement however after enabling hyper threading I was able to reduce it to 25sec from 40. I tried my hands on HDFS cache however even after defining cache_pool size to 3gb only 1 gb data gets cached, Any idea ?
Query: show table stats tbl_parq_123 +-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+ | year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+ | 1990 | -1 | 2 | 338.45MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1990 | | 1993 | -1 | 6 | 1.32GB | 0B | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1993 | | 1994 | -1 | 6 | 1.32GB | 1010.95MB | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1994 | | 1995 | -1 | 14 | 3.24GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1995 | | 1996 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1996 | | 1997 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1997 | | 1998 | -1 | 27 | 6.60GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1998 | | 1999 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1999 | | 2000 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2000 | | 2001 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2001 | | 2002 | -1 | 23 | 5.48GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2002 | | Total | -1 | 148 | 34.79GB | 1010.95MB | | | | | +-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+ F
[root@quickstart ~]# hdfs cacheadmin -listPools Found 1 result. NAME OWNER GROUP MODE LIMIT MAXTTL three_gig_pool impala hdfs rwxr-xr-x 3000000000 never
Thanks