Member since
06-17-2019
23
Posts
0
Kudos Received
0
Solutions
07-08-2019
06:13 AM
Hello,
I need to improve speed for my select query currently I am testing how best I can get from Impala with given hardware.
I have created a parquet table with partitions and block size = 512mb.
Here is some more details -
[quickstart.cloudera:21000] > show table stats tbl_parq_512m;
Query: show table stats tbl_parq_512m
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | 4999950 | 1 | 337.76MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 |
| 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | 19999800 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 |
| 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | 19999800 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 |
| 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | 48999020 | 7 | 3.23GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 |
| 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 |
| 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 |
| 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | 99998000 | 14 | 6.59GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 |
| 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 |
| 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 |
| 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 |
| 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | 82998340 | 11 | 5.47GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 |
| Total | | | | 526989910 | 74 | 34.74GB | 0B | | | | |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+---------------------------------------------------------
When I run the below query , it takes 28 seconds to complete -
[quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10;
Query: select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10
Query submitted at: 2019-07-08 18:32:31 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=88422271f161227b:fe860ef500000000
+-----------+----------------+-------------------+-----------+
| source_ip | destination_ip | avg(volume) | avl |
+-----------+----------------+-------------------+-----------+
| 10.0.0.1 | 20.0.0.1 | 4.499997599976 | 22500415 |
| 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | 90003989 |
| 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | 90023641 |
| 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | 220487564 |
| 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | 224966377 |
| 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | 224977244 |
| 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | 225002743 |
| 56.0.0.1 | 60.0.0.1 | 4.500426968539371 | 225003328 |
| 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | 225013849 |
| 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | 373535589 |
+-----------+----------------+-------------------+-----------+
Fetched 10 row(s) in 27.17s
I am trying to speed up the query so I tried going with gzip compression.
[quickstart.cloudera:21000] > show table stats tbl_gzip_512m;
Query: show table stats tbl_gzip_512m
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | 5000950 | 1 | 231.85MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 |
| 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | 19999800 | 2 | 925.94MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 |
| 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | 19999800 | 2 | 925.94MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 |
| 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | 48999020 | 5 | 2.22GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 |
| 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 |
| 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 |
| 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | 99998000 | 19 | 4.53GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 |
| 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 |
| 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | 49999000 | 10 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 |
| 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | 49999000 | 10 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 |
| 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | 82998340 | 16 | 3.76GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 |
| Total | | | | 526990910 | 80 | 23.84GB | 0B | | | | |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+---------------------------------------------------------
Size has dropped by 30% , it has reduced from 34gb to 24 gb.
I was expecting similar behaviour with query performance as well hoping it will reduce some more time.
Below is my query query-
[quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_gzip_512m group by source_ip,destination_ip order by avl limit 10;
Query: select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_gzip_512m group by source_ip,destination_ip order by avl limit 10
Query submitted at: 2019-07-08 18:39:09 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=8d4bdddc207517ff:952aea9d00000000
+-----------+----------------+-------------------+-----------+
| source_ip | destination_ip | avg(volume) | avl |
+-----------+----------------+-------------------+-----------+
| 10.0.0.1 | 20.0.0.1 | 4.50001219768244 | 22504983 |
| 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | 90003989 |
| 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | 90023641 |
| 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | 220487564 |
| 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | 224966377 |
| 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | 224977244 |
| 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | 225002743 |
| 56.0.0.1 | 60.0.0.1 | 4.500426968539371 | 225003328 |
| 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | 225013849 |
| 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | 373535589 |
+-----------+----------------+-------------------+-----------+
Fetched 10 row(s) in 27.67s
As you can see there isn't any improvement moreover it looks little slow despite decrease in data by 1/3rd.
So am I doing something wrong here when using gzip ?
Thanks
... View more
Labels:
07-03-2019
07:59 AM
Thanks @AcharkiMed I tried that there was no improvement however after enabling hyper threading I was able to reduce it to 25sec from 40. I tried my hands on HDFS cache however even after defining cache_pool size to 3gb only 1 gb data gets cached, Any idea ? Query: show table stats tbl_parq_123
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
| year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
| 1990 | -1 | 2 | 338.45MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1990 |
| 1993 | -1 | 6 | 1.32GB | 0B | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1993 |
| 1994 | -1 | 6 | 1.32GB | 1010.95MB | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1994 |
| 1995 | -1 | 14 | 3.24GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1995 |
| 1996 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1996 |
| 1997 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1997 |
| 1998 | -1 | 27 | 6.60GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1998 |
| 1999 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1999 |
| 2000 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2000 |
| 2001 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2001 |
| 2002 | -1 | 23 | 5.48GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2002 |
| Total | -1 | 148 | 34.79GB | 1010.95MB | | | | |
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
F [root@quickstart ~]# hdfs cacheadmin -listPools
Found 1 result.
NAME OWNER GROUP MODE LIMIT MAXTTL
three_gig_pool impala hdfs rwxr-xr-x 3000000000 never Thanks
... View more
07-01-2019
07:02 AM
Hello, After many days of POC and some tuning I am getting some better results however the performance seems less then what I was expecting. I understand the need of better hardware as Impala is meant for bigdata however I also need to know its performance on some regular systems. For this I created a parquet partitioned table and filled it with 500million data, I feel query is taking twice the time it should take. I had set PARQUET_FILE_SIZE = 512MB Details of my table - tbl_parq_512m +------------------------------+-------------------------------------------------------------------+----------------------+
| name | type | comment |
+------------------------------+-------------------------------------------------------------------+----------------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| protocol_number | int | NULL |
| source_port | int | NULL |
| destination_port | int | NULL |
| packet | int | NULL |
| volume | int | NULL |
| duration | int | NULL |
| pps | int | NULL |
| bps | int | NULL |
| bpp | int | NULL |
| source_latitude | string | NULL |
| source_longitude | string | NULL |
| source_city | string | NULL |
| source_country | string | NULL |
| destination_latitude | string | NULL |
| destination_longitude | string | NULL |
| destination_city | string | NULL |
| destination_country | string | NULL |
| ingress_volume | int | NULL |
| egress_volume | int | NULL |
| ingress_packet | int | NULL |
| egress_packet | int | NULL |
| source_if_index | int | NULL |
| destination_if_index | int | NULL |
| source_host | string | NULL |
| flow | int | NULL |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| source_ip | string | NULL |
| destination_ip | string | NULL |
| year | int | NULL |
| event_date | string | NULL |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | root | NULL |
| CreateTime: | Fri Jun 28 17:08:24 IST 2019 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | STATS_GENERATED_VIA_STATS_TASK | true |
| | transient_lastDdlTime | 1561734077 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
+------------------------------+-------------------------------------------------------------------+----------------------+ My data using table stats- [quickstart.cloudera:21000] > show table stats tbl_parq_512m;
Query: show table stats tbl_parq_512m
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | -1 | 1 | 337.76MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 |
| 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | -1 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 |
| 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | -1 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 |
| 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | -1 | 7 | 3.23GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 |
| 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 |
| 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 |
| 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | -1 | 14 | 6.59GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 |
| 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 |
| 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 |
| 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 |
| 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | -1 | 11 | 5.47GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 |
| Total | | | | -1 | 74 | 34.74GB | 0B | | | | |
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ Block size details - [root@quickstart home]# hdfs fsck -blocks /user/hive/warehouse/tbl_parq_512m;
Connecting to namenode via http://quickstart.cloudera:50070/fsck?ugi=root&blocks=1&path=%2Fuser%2Fhive%2Fwarehouse%2Ftbl_parq_512m
FSCK started by root (auth:SIMPLE) from /172.16.8.177 for path /user/hive/warehouse/tbl_parq_512m at Tue Jul 02 00:47:19 IST 2019
..........................................................................Status: HEALTHY
Total size: 37305601105 B
Total dirs: 46
Total files: 74
Total symlinks: 0
Total blocks (validated): 74 (avg. block size 504129744 B)
Minimally replicated blocks: 74 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Tue Jul 02 00:47:19 IST 2019 in 15 milliseconds
The filesystem under path '/user/hive/warehouse/tbl_parq_512m' is HEALTHY My query and result- [quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10;
Query: select source_ip,destination_ip,avg(volume) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10
Query submitted at: 2019-07-01 18:24:10 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=fd40c639c5e21365:3cbcfc1400000000
+-----------+----------------+-------------------+
| source_ip | destination_ip | avl |
+-----------+----------------+-------------------+
| 41.0.0.1 | 45.0.0.1 | 4.498248032480324 |
| 46.0.0.1 | 50.0.0.1 | 4.499559623845538 |
| 30.0.0.1 | 40.0.0.1 | 4.499744347443475 |
| 91.0.0.1 | 95.0.0.1 | 4.499926612990091 |
| 80.0.0.1 | 85.0.0.1 | 4.499946078921578 |
| 10.0.0.1 | 20.0.0.1 | 4.499997599976 |
| 51.0.0.1 | 55.0.0.1 | 4.500039240784815 |
| 61.0.0.1 | 70.0.0.1 | 4.500189103782076 |
| 86.0.0.1 | 90.0.0.1 | 4.500191183823676 |
| 71.0.0.1 | 75.0.0.1 | 4.500268505370108 |
+-----------+----------------+-------------------+
Fetched 10 row(s) in 34.70s As we can see its takes on average 34 seconds however I was expecting it to be quicker then this. I am using the latest VM shipped and Cloudera Express. I have increased my RAM to 10gb and harddisk to 250gb. What more tunning can be done to improve performance, Do I need to tune CentOS ? Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager
06-26-2019
08:26 AM
Hello, Finally I was able to add values into partition table after much reading into the concepts and trial & error. However the load file is still not working and also I came to know while doing that when we specify partition value into query it actually is treated as column. insert into tbl_raw_v12 partition(source_ip="192.168.1.10",destination_ip="172.16.8.177",year,event_date) select * from tbl_123; I managed to add 50k values, tbl_raw_v12 contained 29 columns and tbl_123 contains 27 columns and remaining two column values are added through insert query at runtime. I will try with more values tomorrow to check performance. Thanks
... View more
06-26-2019
06:36 AM
Hi @EricL I tried for very small table like you have showed and it worked but for even text files which contain roughly 50k values I was getting error. This method seems not suitable for POC purpose or production. I tried copying mere 50k records which I had loaded into other table and it threw memory error. [quickstart.cloudera:21000] > insert into tbl_raw_part_v11 partition(source_ip,destination_ip,year,event_date) select * from tbl_motadata;
Query: insert into tbl_raw_part_v11 partition(source_ip,destination_ip,year,event_date) select * from tbl_motadata
Query submitted at: 2019-06-26 18:43:21 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=584e2c3facce84f0:a95b21f300000000
WARNINGS: Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 584e2c3facce84f0:a95b21f300000000
Memory left in process limit: -4.25 GB
Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
Fragment 584e2c3facce84f0:a95b21f300000000: Reservation=0 OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
HDFS_SCAN_NODE (id=0): Total=22.82 MB Peak=30.82 MB
HdfsTableSink: Total=3.94 GB Peak=3.94 GB
CodeGen: Total=6.60 KB Peak=690.00 KBProcess: memory limit exceeded. Limit=3.00 GB Total=7.25 GB Peak=7.40 GB
Buffer Pool: Free Buffers: Total=0
Buffer Pool: Clean Pages: Total=0
Buffer Pool: Unused Reservation: Total=0
RequestPool=root.root: Total=3.96 GB Peak=4.12 GB
Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
Untracked Memory: Total=3.29 GB
WARNING: The following tables are missing relevant table and/or column statistics.
default.tbl_motadata
Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 584e2c3facce84f0:a95b21f300000000
Memory left in process limit: -4.25 GB
Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
Fragment 584e2c3facce84f0:a95b21f300000000: Reservation=0 OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
HDFS_SCAN_NODE (id=0): Total=22.82 MB Peak=30.82 MB
HdfsTableSink: Total=3.94 GB Peak=3.94 GB
CodeGen: Total=6.60 KB Peak=690.00 KBProcess: memory limit exceeded. Limit=3.00 GB Total=7.25 GB Peak=7.40 GB
Buffer Pool: Free Buffers: Total=0
Buffer Pool: Clean Pages: Total=0
Buffer Pool: Unused Reservation: Total=0
RequestPool=root.root: Total=3.96 GB Peak=4.12 GB
Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
Untracked Memory: Total=3.29 GB (1 of 2 similar)
The same doesn't throw error for non-partitioned tables. Yesterday I copied some 500million records using "INSERT INTO table1 Select * from table2" on same machine though it took 4hours but the task was complete, but here I am getting error for copying mere 50k records. I am actually testing Impala performance for our production system needs however I believe we need to tune it extensivley for better performance, even Select queries are taking too much time ,it took 46 minutes for query to execute and same query I had executed in Clickhouse db which took only 2min from same hardware configuration. We where actually expecting to get result in few seconds as we thought Impala should be way faster then Clickhouse. I don't know why "Insert ...into [Partition] Select *" throws error but without partition it works fine. Is there any other method to check. I am actually checking against 500 million records. Thanks
... View more
06-26-2019
01:17 AM
Hi @EricL I think issue here is something different the above steps are not working nor anything which I tried from internet. I dropped table and created it again but its throwing same error. ERROR: AnalysisException: Partition spec does not exist: (year=2009). My create query- create table tbl_id(id int) partitioned by(year int); Here is my tables full details using desc formatted tbl_part- +------------------------------+--------------------------------------------------------------+----------------------+
| name | type | comment |
+------------------------------+--------------------------------------------------------------+----------------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| id | int | NULL |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| year | int | NULL |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | root | NULL |
| CreateTime: | Wed Jun 26 13:33:02 IST 2019 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_part | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | transient_lastDdlTime | 1561536182 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
+------------------------------+--------------------------------------------------------------+----------------------+ I tried with two files one which doesn't contain partition column value- 2009.txt 10
20
30
40 This one contains partition column values- 2010.txt 5,2010
1,2010
3,2010 I had used same query to execute just different parttion values- I also created the the same table by specifying "row format delimited fields terminated by ','". This is very simple thing and it should have worked, I don't know whether there is some configuration issue or something unknown. Do you see anything wrong in above steps ? Meanwhile I will try temp table method to see how does that works. Thanks
... View more
06-25-2019
11:15 PM
Hi @EricL The dynamic part is not that important my concern is how should I upload the data into partitioned table using LOAD, I tried with regular method of directly using - [quickstart.cloudera:21000] > load data inpath '/user/impala/data/id.txt' into table tbl_id_part; But it didn't worked my text file id.txt contained both regular column and partitioned column data. So I tried with method by specifying partitioned value in query- load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010); But this is also not working. So how does the given query works? and on what kind of table, what is the structure of data file it needs ? LOAD DATA INPATH 'hdfs_file_or_directory_path' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)] I have created a basic partitioned table using below query- create table tbl_id_part(id int) partitioned by (year int) row format delimited fields terminated by ','; I am able to use load into regular tables but with partitioned I am getting errors. What should my data file which I will load into this table look like ? What should be my load statement so that I am able to load data into partitioned table. Edit- I have also tried following hive example but I am getting same error mentioned. https://bigdataprogrammers.com/load-files-into-hive-partitioned-table/ Thanks
... View more
06-25-2019
10:40 AM
Hello, I am trying to add data into my table using load data method, however I was getting error so to ease the complexity and understand the concept better I reduced my table however still I am getting error. I have tried two methods both throw same error. There isn't much example available- My create query- create table tbl_id_part(id int) partitioned by (year int) row format delimited fields terminated by ','; Now I tried two methods here - 1. I created a text file id.txt with data 2,2009
10,2009
15,2010
34,2010 My LOAD query- [quickstart.cloudera:21000] > load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010); When I executed above query I got- Query: load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010)
ERROR: AnalysisException: Partition spec does not exist: (year=2010). I thought it must be throwing error id.txt contains year data so I tried with different id1.txt file- 5
10
20
30 My load query- [quickstart.cloudera:21000] > load data inpath '/user/impala/data/id1.txt' into table tbl_id_part partition(year=2010); I am getting same error- Query: load data inpath '/user/impala/data/id1.txt' into table tbl_id_part partition(year=2010)
ERROR: AnalysisException: Partition spec does not exist: (year=2010). My question is how should my data text file look like, the partioned column values should be inside file or not. I am pretty much confuse , please guide. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager
06-24-2019
09:38 PM
I will be getting data from network via snmp or serial. Loading data from file is very fast , I did it manually first to check and it worked however when I try to do the same with my java code I get different errors. I am trying few steps but getting error -- 1. I wrote 1 miilion records into text file through java. 2. I tried to upload the generate text file into HDFS through my java code. 3. I am executing LOAD query through code. However I am unable to execute step 2- 1. I tried java method of ProcessBuilder(), getRuntime.exec() but it didn't worked. Then I tried searching for some API to do following task but I was getting error- java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at HdfsWriter.run(impala_crt.java:935)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at impala_crt.writeFile(impala_crt.java:861)
at impala_crt.main(impala_crt.java:106) Do you have any solution for this or there is some better method ?
... View more
06-20-2019
12:36 AM
Hi @Andrew_Sherman I have accepted this answer but yet one thing bothers me addbatch() is supported so what is the use of addbatch() if I cannot execute it. Thanks
... View more