Member since
06-17-2019
23
Posts
0
Kudos Received
0
Solutions
07-08-2019
06:13 AM
Hello,
I need to improve speed for my select query currently I am testing how best I can get from Impala with given hardware.
I have created a parquet table with partitions and block size = 512mb.
Here is some more details -
[quickstart.cloudera:21000] > show table stats tbl_parq_512m;
Query: show table stats tbl_parq_512m
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | 4999950 | 1 | 337.76MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 |
| 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | 19999800 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 |
| 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | 19999800 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 |
| 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | 48999020 | 7 | 3.23GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 |
| 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 |
| 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 |
| 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | 99998000 | 14 | 6.59GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 |
| 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 |
| 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 |
| 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 |
| 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | 82998340 | 11 | 5.47GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 |
| Total | | | | 526989910 | 74 | 34.74GB | 0B | | | | |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+---------------------------------------------------------
When I run the below query , it takes 28 seconds to complete -
[quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10;
Query: select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10
Query submitted at: 2019-07-08 18:32:31 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=88422271f161227b:fe860ef500000000
+-----------+----------------+-------------------+-----------+
| source_ip | destination_ip | avg(volume) | avl |
+-----------+----------------+-------------------+-----------+
| 10.0.0.1 | 20.0.0.1 | 4.499997599976 | 22500415 |
| 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | 90003989 |
| 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | 90023641 |
| 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | 220487564 |
| 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | 224966377 |
| 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | 224977244 |
| 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | 225002743 |
| 56.0.0.1 | 60.0.0.1 | 4.500426968539371 | 225003328 |
| 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | 225013849 |
| 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | 373535589 |
+-----------+----------------+-------------------+-----------+
Fetched 10 row(s) in 27.17s
I am trying to speed up the query so I tried going with gzip compression.
[quickstart.cloudera:21000] > show table stats tbl_gzip_512m;
Query: show table stats tbl_gzip_512m
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | 5000950 | 1 | 231.85MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 |
| 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | 19999800 | 2 | 925.94MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 |
| 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | 19999800 | 2 | 925.94MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 |
| 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | 48999020 | 5 | 2.22GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 |
| 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 |
| 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 |
| 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | 99998000 | 19 | 4.53GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 |
| 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 |
| 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | 49999000 | 10 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 |
| 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | 49999000 | 10 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 |
| 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | 82998340 | 16 | 3.76GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 |
| Total | | | | 526990910 | 80 | 23.84GB | 0B | | | | |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+---------------------------------------------------------
Size has dropped by 30% , it has reduced from 34gb to 24 gb.
I was expecting similar behaviour with query performance as well hoping it will reduce some more time.
Below is my query query-
[quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_gzip_512m group by source_ip,destination_ip order by avl limit 10;
Query: select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_gzip_512m group by source_ip,destination_ip order by avl limit 10
Query submitted at: 2019-07-08 18:39:09 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=8d4bdddc207517ff:952aea9d00000000
+-----------+----------------+-------------------+-----------+
| source_ip | destination_ip | avg(volume) | avl |
+-----------+----------------+-------------------+-----------+
| 10.0.0.1 | 20.0.0.1 | 4.50001219768244 | 22504983 |
| 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | 90003989 |
| 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | 90023641 |
| 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | 220487564 |
| 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | 224966377 |
| 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | 224977244 |
| 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | 225002743 |
| 56.0.0.1 | 60.0.0.1 | 4.500426968539371 | 225003328 |
| 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | 225013849 |
| 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | 373535589 |
+-----------+----------------+-------------------+-----------+
Fetched 10 row(s) in 27.67s
As you can see there isn't any improvement moreover it looks little slow despite decrease in data by 1/3rd.
So am I doing something wrong here when using gzip ?
Thanks
... View more
Labels:
07-03-2019
07:59 AM
Thanks @AcharkiMed I tried that there was no improvement however after enabling hyper threading I was able to reduce it to 25sec from 40. I tried my hands on HDFS cache however even after defining cache_pool size to 3gb only 1 gb data gets cached, Any idea ? Query: show table stats tbl_parq_123
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
| year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
| 1990 | -1 | 2 | 338.45MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1990 |
| 1993 | -1 | 6 | 1.32GB | 0B | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1993 |
| 1994 | -1 | 6 | 1.32GB | 1010.95MB | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1994 |
| 1995 | -1 | 14 | 3.24GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1995 |
| 1996 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1996 |
| 1997 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1997 |
| 1998 | -1 | 27 | 6.60GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1998 |
| 1999 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1999 |
| 2000 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2000 |
| 2001 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2001 |
| 2002 | -1 | 23 | 5.48GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2002 |
| Total | -1 | 148 | 34.79GB | 1010.95MB | | | | |
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
F [root@quickstart ~]# hdfs cacheadmin -listPools
Found 1 result.
NAME OWNER GROUP MODE LIMIT MAXTTL
three_gig_pool impala hdfs rwxr-xr-x 3000000000 never Thanks
... View more
07-01-2019
07:02 AM
Hello, After many days of POC and some tuning I am getting some better results however the performance seems less then what I was expecting. I understand the need of better hardware as Impala is meant for bigdata however I also need to know its performance on some regular systems. For this I created a parquet partitioned table and filled it with 500million data, I feel query is taking twice the time it should take. I had set PARQUET_FILE_SIZE = 512MB Details of my table - tbl_parq_512m +------------------------------+-------------------------------------------------------------------+----------------------+
| name | type | comment |
+------------------------------+-------------------------------------------------------------------+----------------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| protocol_number | int | NULL |
| source_port | int | NULL |
| destination_port | int | NULL |
| packet | int | NULL |
| volume | int | NULL |
| duration | int | NULL |
| pps | int | NULL |
| bps | int | NULL |
| bpp | int | NULL |
| source_latitude | string | NULL |
| source_longitude | string | NULL |
| source_city | string | NULL |
| source_country | string | NULL |
| destination_latitude | string | NULL |
| destination_longitude | string | NULL |
| destination_city | string | NULL |
| destination_country | string | NULL |
| ingress_volume | int | NULL |
| egress_volume | int | NULL |
| ingress_packet | int | NULL |
| egress_packet | int | NULL |
| source_if_index | int | NULL |
| destination_if_index | int | NULL |
| source_host | string | NULL |
| flow | int | NULL |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| source_ip | string | NULL |
| destination_ip | string | NULL |
| year | int | NULL |
| event_date | string | NULL |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | root | NULL |
| CreateTime: | Fri Jun 28 17:08:24 IST 2019 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | STATS_GENERATED_VIA_STATS_TASK | true |
| | transient_lastDdlTime | 1561734077 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
+------------------------------+-------------------------------------------------------------------+----------------------+ My data using table stats- [quickstart.cloudera:21000] > show table stats tbl_parq_512m;
Query: show table stats tbl_parq_512m
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | -1 | 1 | 337.76MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 |
| 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | -1 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 |
| 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | -1 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 |
| 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | -1 | 7 | 3.23GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 |
| 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 |
| 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 |
| 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | -1 | 14 | 6.59GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 |
| 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 |
| 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 |
| 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 |
| 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | -1 | 11 | 5.47GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 |
| Total | | | | -1 | 74 | 34.74GB | 0B | | | | |
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ Block size details - [root@quickstart home]# hdfs fsck -blocks /user/hive/warehouse/tbl_parq_512m;
Connecting to namenode via http://quickstart.cloudera:50070/fsck?ugi=root&blocks=1&path=%2Fuser%2Fhive%2Fwarehouse%2Ftbl_parq_512m
FSCK started by root (auth:SIMPLE) from /172.16.8.177 for path /user/hive/warehouse/tbl_parq_512m at Tue Jul 02 00:47:19 IST 2019
..........................................................................Status: HEALTHY
Total size: 37305601105 B
Total dirs: 46
Total files: 74
Total symlinks: 0
Total blocks (validated): 74 (avg. block size 504129744 B)
Minimally replicated blocks: 74 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Tue Jul 02 00:47:19 IST 2019 in 15 milliseconds
The filesystem under path '/user/hive/warehouse/tbl_parq_512m' is HEALTHY My query and result- [quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10;
Query: select source_ip,destination_ip,avg(volume) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10
Query submitted at: 2019-07-01 18:24:10 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=fd40c639c5e21365:3cbcfc1400000000
+-----------+----------------+-------------------+
| source_ip | destination_ip | avl |
+-----------+----------------+-------------------+
| 41.0.0.1 | 45.0.0.1 | 4.498248032480324 |
| 46.0.0.1 | 50.0.0.1 | 4.499559623845538 |
| 30.0.0.1 | 40.0.0.1 | 4.499744347443475 |
| 91.0.0.1 | 95.0.0.1 | 4.499926612990091 |
| 80.0.0.1 | 85.0.0.1 | 4.499946078921578 |
| 10.0.0.1 | 20.0.0.1 | 4.499997599976 |
| 51.0.0.1 | 55.0.0.1 | 4.500039240784815 |
| 61.0.0.1 | 70.0.0.1 | 4.500189103782076 |
| 86.0.0.1 | 90.0.0.1 | 4.500191183823676 |
| 71.0.0.1 | 75.0.0.1 | 4.500268505370108 |
+-----------+----------------+-------------------+
Fetched 10 row(s) in 34.70s As we can see its takes on average 34 seconds however I was expecting it to be quicker then this. I am using the latest VM shipped and Cloudera Express. I have increased my RAM to 10gb and harddisk to 250gb. What more tunning can be done to improve performance, Do I need to tune CentOS ? Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager
06-20-2019
12:36 AM
Hi @Andrew_Sherman I have accepted this answer but yet one thing bothers me addbatch() is supported so what is the use of addbatch() if I cannot execute it. Thanks
... View more
06-18-2019
12:27 AM
I wrote a java code to insert few thousand records by creating batch, I was not getting good performance from PreparedStatements so I switched to Statements and see if there is some performance difference. However after getting error and reading Impala doc I came to know executebatch() is not supported for statements but I am able to use addBatch() so its kind a strange. Here I am posting some of my code- private static Connection connectViaDS() throws Exception {
Connection connection = null;
Class.forName("com.cloudera.impala.jdbc41.DataSource");
connection = DriverManager.getConnection(CONNECTION_URL);
return connection;
}
//inside main--
// I have declared all the variables,I have lots of other data as well so it may create more confusion
try {
connection = connectViaDS();
connection.setAutoCommit(false);
Statement statement = connection.createStatement();
Timestamp ed = new Timestamp(e_date);
Timestamp et = new Timestamp(e_time);
while(counter <total) {
for (int index = 1; index <= records; index++) {
counter++;
String stmt_query = "INSERT INTO tbl_mindarray (source_ip,destination_ip,protocol_number" + index +"," +
"source_port" + index + ",destination_port" + index + ",packet,volume" + index + ",duration" + index + "," +
"pps" + index + ",bps" + index + ",bpp" + index + ",source_latitude,source_longitude," +
"source_city,source_country,destination_latitude,destination_longitude ,destination_city ,destination_country ," +
"ingress_volume" + index + " ,egress_volume " + index + ",ingress_packet ,egress_packet " + index + ",source_if_index ,destination_if_index," +
"source_host,event_date,event_time,_time,flow,year)" + "VALUES ('s_ip','d_ip',234,23,56,556,34,27,46,389,76," +
"'123123d','456456c','Damouli','Nepal','234234d','678678c','Mumbai','India',123,456,786,324,544,23,'192.168.1.44'," +
"345" + index +",345" + index +",2010" + index +")";
statement.addBatch(stmt_query);
}
statement.executeBatch();
statement.clearBatch();
//connection.commit();
System.out.println("Counter = "+counter);
//flag++;
} My question is how do I executebatch() is its not supported, addbatch() is supported so there must be something to execute it. Here is the error which I am getting java.sql.SQLException: Error message not found: NOT_IMPLEMENTED. Can't find resource for bundle java.util.PropertyResourceBundle, key NOT_IMPLEMENTED
at com.cloudera.impala.hivecommon.dataengine.HiveJDBCDataEngine.prepareBatch(Unknown Source)
at com.cloudera.impala.jdbc.common.SStatement.executeAnyBatch(Unknown Source)
at com.cloudera.impala.jdbc.common.SStatement.executeBatch(Unknown Source)
at impala_crt.writeInABatchWithCompiledQuery(impala_crt.java:278)
Caused by: com.cloudera.impala.support.exceptions.GeneralException: Error message not found: NOT_IMPLEMENTED. Can't find resource for bundle java.util.PropertyResourceBundle, key NOT_IMPLEMENTED
... 4 more Somewhere I read each prepared statement creates a file into hdfs so I believe statement should be faster as I can insert all row values using just one statement. It would be really helpful if someone can share there knowledge on this. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager