Member since
06-17-2019
23
Posts
0
Kudos Received
0
Solutions
07-15-2019
04:38 AM
Hello, I am checking impala performance on multiple systems with different hardware configuration, while testing in on 8 core system I see only 1 core touches 100% CPU while remaining cores are idle when executing select query , I am not able to use the entire hardware so how can I make all cores work for impala ? I had used htop to check cpu usage. Currently my cluster is single node, 12gb RAM, 8core cpu. I am using latest Cloudera VM. I checked impala and hdfs configs but I found only Cgroup CPU Shares = 1024, is there some parameter which defines no. of cores to be used ? Thanks punshi
... View more
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager
07-08-2019
06:13 AM
Hello,
I need to improve speed for my select query currently I am testing how best I can get from Impala with given hardware.
I have created a parquet table with partitions and block size = 512mb.
Here is some more details -
[quickstart.cloudera:21000] > show table stats tbl_parq_512m;
Query: show table stats tbl_parq_512m
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | 4999950 | 1 | 337.76MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 |
| 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | 19999800 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 |
| 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | 19999800 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 |
| 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | 48999020 | 7 | 3.23GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 |
| 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 |
| 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 |
| 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | 99998000 | 14 | 6.59GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 |
| 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 |
| 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 |
| 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 |
| 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | 82998340 | 11 | 5.47GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 |
| Total | | | | 526989910 | 74 | 34.74GB | 0B | | | | |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+---------------------------------------------------------
When I run the below query , it takes 28 seconds to complete -
[quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10;
Query: select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10
Query submitted at: 2019-07-08 18:32:31 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=88422271f161227b:fe860ef500000000
+-----------+----------------+-------------------+-----------+
| source_ip | destination_ip | avg(volume) | avl |
+-----------+----------------+-------------------+-----------+
| 10.0.0.1 | 20.0.0.1 | 4.499997599976 | 22500415 |
| 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | 90003989 |
| 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | 90023641 |
| 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | 220487564 |
| 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | 224966377 |
| 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | 224977244 |
| 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | 225002743 |
| 56.0.0.1 | 60.0.0.1 | 4.500426968539371 | 225003328 |
| 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | 225013849 |
| 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | 373535589 |
+-----------+----------------+-------------------+-----------+
Fetched 10 row(s) in 27.17s
I am trying to speed up the query so I tried going with gzip compression.
[quickstart.cloudera:21000] > show table stats tbl_gzip_512m;
Query: show table stats tbl_gzip_512m
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | 5000950 | 1 | 231.85MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 |
| 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | 19999800 | 2 | 925.94MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 |
| 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | 19999800 | 2 | 925.94MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 |
| 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | 48999020 | 5 | 2.22GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 |
| 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 |
| 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 |
| 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | 99998000 | 19 | 4.53GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 |
| 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 |
| 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | 49999000 | 10 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 |
| 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | 49999000 | 10 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 |
| 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | 82998340 | 16 | 3.76GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 |
| Total | | | | 526990910 | 80 | 23.84GB | 0B | | | | |
+-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+---------------------------------------------------------
Size has dropped by 30% , it has reduced from 34gb to 24 gb.
I was expecting similar behaviour with query performance as well hoping it will reduce some more time.
Below is my query query-
[quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_gzip_512m group by source_ip,destination_ip order by avl limit 10;
Query: select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_gzip_512m group by source_ip,destination_ip order by avl limit 10
Query submitted at: 2019-07-08 18:39:09 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=8d4bdddc207517ff:952aea9d00000000
+-----------+----------------+-------------------+-----------+
| source_ip | destination_ip | avg(volume) | avl |
+-----------+----------------+-------------------+-----------+
| 10.0.0.1 | 20.0.0.1 | 4.50001219768244 | 22504983 |
| 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | 90003989 |
| 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | 90023641 |
| 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | 220487564 |
| 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | 224966377 |
| 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | 224977244 |
| 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | 225002743 |
| 56.0.0.1 | 60.0.0.1 | 4.500426968539371 | 225003328 |
| 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | 225013849 |
| 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | 373535589 |
+-----------+----------------+-------------------+-----------+
Fetched 10 row(s) in 27.67s
As you can see there isn't any improvement moreover it looks little slow despite decrease in data by 1/3rd.
So am I doing something wrong here when using gzip ?
Thanks
... View more
Labels:
07-04-2019
05:40 AM
Thanks @AcharkiMed I am still thinking if I am providing cache pool 3gb using hdfs cacheadmin so why does it allocates only 1gb. Is it to do with RAM size? I was actually thinking the hdfs cache is using space from my hard disk and not RAM.
... View more
07-03-2019
11:55 PM
Thanks @AcharkiMed That command really shows clear picture now but I still don't know why its value is set to 1gb. Configured Cache capacity - 1gb [root@quickstart ~]# sudo -u hdfs hdfs dfsadmin -report
Configured Capacity: 250717949952 (233.50 GB)
Present Capacity: 208179929088 (193.88 GB)
DFS Remaining: 88135348224 (82.08 GB)
DFS Used: 120044580864 (111.80 GB)
DFS Used%: 57.66%
Under replicated blocks: 4
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (1):
Name: 172.16.8.177:50010 (quickstart.cloudera)
Hostname: quickstart.cloudera
Rack: /default
Decommission Status : Normal
Configured Capacity: 250717949952 (233.50 GB)
DFS Used: 120044580864 (111.80 GB)
Non DFS Used: 29259272192 (27.25 GB)
DFS Remaining: 88135348224 (82.08 GB)
DFS Used%: 47.88%
DFS Remaining%: 35.15%
Configured Cache Capacity: 1073741824 (1 GB)
Cache Used: 1060061184 (1010.95 MB)
Cache Remaining: 13680640 (13.05 MB)
Cache Used%: 98.73%
Cache Remaining%: 1.27%
Xceivers: 2
Last contact: Thu Jul 04 12:04:21 IST 2019
I checked hdfs-default.xml to see some parameter defining this value but couldn't find. I saw one parameter dfs.datanode.max.locked.memory=0, but I feel its different. Is it automatic and depended on RAM or I can configure it. Thanks
... View more
07-03-2019
08:19 AM
Hello,
I am trying to use HDFS cache to see some performance improvement however what I see is I am unable to use full cache pool defined.
I tried refresh table statement, I tried creating a new table with cache defined from start but none of it worked. I had removed cache and assign again but no difference.
My table stats-
Query: show table stats tbl_parq_123
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
| year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
| 1990 | -1 | 2 | 338.45MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1990 |
| 1993 | -1 | 6 | 1.32GB | 0B | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1993 |
| 1994 | -1 | 6 | 1.32GB | 1010.95MB | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1994 |
| 1995 | -1 | 14 | 3.24GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1995 |
| 1996 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1996 |
| 1997 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1997 |
| 1998 | -1 | 27 | 6.60GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1998 |
| 1999 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1999 |
| 2000 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2000 |
| 2001 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2001 |
| 2002 | -1 | 23 | 5.48GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2002 |
| Total | -1 | 148 | 34.79GB | 1010.95MB | | | | |
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
Pool size-
[root@quickstart ~]# hdfs cacheadmin -listPools
Found 1 result.
NAME OWNER GROUP MODE LIMIT MAXTTL
three_gig_pool impala hdfs rwxr-xr-x 3000000000 never
My method-
[quickstart.cloudera:21000] > alter table tbl_parq_123 set cached in 'three_gig_pool';
Query: alter table tbl_parq_123 set cached in 'three_gig_pool'
+---------------+
| summary |
+---------------+
| Cached table. |
+---------------+
Fetched 1 row(s) in 1.98s
Sometime cached data would be 500mb, 800mb but it never crossed 1gb. Is there any parameter or something which I need to check ?
Thanks
... View more
Labels:
07-03-2019
07:59 AM
Thanks @AcharkiMed I tried that there was no improvement however after enabling hyper threading I was able to reduce it to 25sec from 40. I tried my hands on HDFS cache however even after defining cache_pool size to 3gb only 1 gb data gets cached, Any idea ? Query: show table stats tbl_parq_123
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
| year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
| 1990 | -1 | 2 | 338.45MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1990 |
| 1993 | -1 | 6 | 1.32GB | 0B | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1993 |
| 1994 | -1 | 6 | 1.32GB | 1010.95MB | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1994 |
| 1995 | -1 | 14 | 3.24GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1995 |
| 1996 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1996 |
| 1997 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1997 |
| 1998 | -1 | 27 | 6.60GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1998 |
| 1999 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1999 |
| 2000 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2000 |
| 2001 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2001 |
| 2002 | -1 | 23 | 5.48GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2002 |
| Total | -1 | 148 | 34.79GB | 1010.95MB | | | | |
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
F [root@quickstart ~]# hdfs cacheadmin -listPools
Found 1 result.
NAME OWNER GROUP MODE LIMIT MAXTTL
three_gig_pool impala hdfs rwxr-xr-x 3000000000 never Thanks
... View more
07-01-2019
07:02 AM
Hello, After many days of POC and some tuning I am getting some better results however the performance seems less then what I was expecting. I understand the need of better hardware as Impala is meant for bigdata however I also need to know its performance on some regular systems. For this I created a parquet partitioned table and filled it with 500million data, I feel query is taking twice the time it should take. I had set PARQUET_FILE_SIZE = 512MB Details of my table - tbl_parq_512m +------------------------------+-------------------------------------------------------------------+----------------------+
| name | type | comment |
+------------------------------+-------------------------------------------------------------------+----------------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| protocol_number | int | NULL |
| source_port | int | NULL |
| destination_port | int | NULL |
| packet | int | NULL |
| volume | int | NULL |
| duration | int | NULL |
| pps | int | NULL |
| bps | int | NULL |
| bpp | int | NULL |
| source_latitude | string | NULL |
| source_longitude | string | NULL |
| source_city | string | NULL |
| source_country | string | NULL |
| destination_latitude | string | NULL |
| destination_longitude | string | NULL |
| destination_city | string | NULL |
| destination_country | string | NULL |
| ingress_volume | int | NULL |
| egress_volume | int | NULL |
| ingress_packet | int | NULL |
| egress_packet | int | NULL |
| source_if_index | int | NULL |
| destination_if_index | int | NULL |
| source_host | string | NULL |
| flow | int | NULL |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| source_ip | string | NULL |
| destination_ip | string | NULL |
| year | int | NULL |
| event_date | string | NULL |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | root | NULL |
| CreateTime: | Fri Jun 28 17:08:24 IST 2019 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | STATS_GENERATED_VIA_STATS_TASK | true |
| | transient_lastDdlTime | 1561734077 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
+------------------------------+-------------------------------------------------------------------+----------------------+ My data using table stats- [quickstart.cloudera:21000] > show table stats tbl_parq_512m;
Query: show table stats tbl_parq_512m
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | -1 | 1 | 337.76MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 |
| 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | -1 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 |
| 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | -1 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 |
| 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | -1 | 7 | 3.23GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 |
| 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 |
| 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 |
| 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | -1 | 14 | 6.59GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 |
| 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 |
| 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 |
| 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | -1 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 |
| 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | -1 | 11 | 5.47GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 |
| Total | | | | -1 | 74 | 34.74GB | 0B | | | | |
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ Block size details - [root@quickstart home]# hdfs fsck -blocks /user/hive/warehouse/tbl_parq_512m;
Connecting to namenode via http://quickstart.cloudera:50070/fsck?ugi=root&blocks=1&path=%2Fuser%2Fhive%2Fwarehouse%2Ftbl_parq_512m
FSCK started by root (auth:SIMPLE) from /172.16.8.177 for path /user/hive/warehouse/tbl_parq_512m at Tue Jul 02 00:47:19 IST 2019
..........................................................................Status: HEALTHY
Total size: 37305601105 B
Total dirs: 46
Total files: 74
Total symlinks: 0
Total blocks (validated): 74 (avg. block size 504129744 B)
Minimally replicated blocks: 74 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Tue Jul 02 00:47:19 IST 2019 in 15 milliseconds
The filesystem under path '/user/hive/warehouse/tbl_parq_512m' is HEALTHY My query and result- [quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10;
Query: select source_ip,destination_ip,avg(volume) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10
Query submitted at: 2019-07-01 18:24:10 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=fd40c639c5e21365:3cbcfc1400000000
+-----------+----------------+-------------------+
| source_ip | destination_ip | avl |
+-----------+----------------+-------------------+
| 41.0.0.1 | 45.0.0.1 | 4.498248032480324 |
| 46.0.0.1 | 50.0.0.1 | 4.499559623845538 |
| 30.0.0.1 | 40.0.0.1 | 4.499744347443475 |
| 91.0.0.1 | 95.0.0.1 | 4.499926612990091 |
| 80.0.0.1 | 85.0.0.1 | 4.499946078921578 |
| 10.0.0.1 | 20.0.0.1 | 4.499997599976 |
| 51.0.0.1 | 55.0.0.1 | 4.500039240784815 |
| 61.0.0.1 | 70.0.0.1 | 4.500189103782076 |
| 86.0.0.1 | 90.0.0.1 | 4.500191183823676 |
| 71.0.0.1 | 75.0.0.1 | 4.500268505370108 |
+-----------+----------------+-------------------+
Fetched 10 row(s) in 34.70s As we can see its takes on average 34 seconds however I was expecting it to be quicker then this. I am using the latest VM shipped and Cloudera Express. I have increased my RAM to 10gb and harddisk to 250gb. What more tunning can be done to improve performance, Do I need to tune CentOS ? Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager
06-26-2019
08:26 AM
Hello, Finally I was able to add values into partition table after much reading into the concepts and trial & error. However the load file is still not working and also I came to know while doing that when we specify partition value into query it actually is treated as column. insert into tbl_raw_v12 partition(source_ip="192.168.1.10",destination_ip="172.16.8.177",year,event_date) select * from tbl_123; I managed to add 50k values, tbl_raw_v12 contained 29 columns and tbl_123 contains 27 columns and remaining two column values are added through insert query at runtime. I will try with more values tomorrow to check performance. Thanks
... View more
06-26-2019
06:36 AM
Hi @EricL I tried for very small table like you have showed and it worked but for even text files which contain roughly 50k values I was getting error. This method seems not suitable for POC purpose or production. I tried copying mere 50k records which I had loaded into other table and it threw memory error. [quickstart.cloudera:21000] > insert into tbl_raw_part_v11 partition(source_ip,destination_ip,year,event_date) select * from tbl_motadata;
Query: insert into tbl_raw_part_v11 partition(source_ip,destination_ip,year,event_date) select * from tbl_motadata
Query submitted at: 2019-06-26 18:43:21 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=584e2c3facce84f0:a95b21f300000000
WARNINGS: Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 584e2c3facce84f0:a95b21f300000000
Memory left in process limit: -4.25 GB
Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
Fragment 584e2c3facce84f0:a95b21f300000000: Reservation=0 OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
HDFS_SCAN_NODE (id=0): Total=22.82 MB Peak=30.82 MB
HdfsTableSink: Total=3.94 GB Peak=3.94 GB
CodeGen: Total=6.60 KB Peak=690.00 KBProcess: memory limit exceeded. Limit=3.00 GB Total=7.25 GB Peak=7.40 GB
Buffer Pool: Free Buffers: Total=0
Buffer Pool: Clean Pages: Total=0
Buffer Pool: Unused Reservation: Total=0
RequestPool=root.root: Total=3.96 GB Peak=4.12 GB
Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
Untracked Memory: Total=3.29 GB
WARNING: The following tables are missing relevant table and/or column statistics.
default.tbl_motadata
Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 584e2c3facce84f0:a95b21f300000000
Memory left in process limit: -4.25 GB
Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
Fragment 584e2c3facce84f0:a95b21f300000000: Reservation=0 OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
HDFS_SCAN_NODE (id=0): Total=22.82 MB Peak=30.82 MB
HdfsTableSink: Total=3.94 GB Peak=3.94 GB
CodeGen: Total=6.60 KB Peak=690.00 KBProcess: memory limit exceeded. Limit=3.00 GB Total=7.25 GB Peak=7.40 GB
Buffer Pool: Free Buffers: Total=0
Buffer Pool: Clean Pages: Total=0
Buffer Pool: Unused Reservation: Total=0
RequestPool=root.root: Total=3.96 GB Peak=4.12 GB
Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB
RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
Untracked Memory: Total=3.29 GB (1 of 2 similar)
The same doesn't throw error for non-partitioned tables. Yesterday I copied some 500million records using "INSERT INTO table1 Select * from table2" on same machine though it took 4hours but the task was complete, but here I am getting error for copying mere 50k records. I am actually testing Impala performance for our production system needs however I believe we need to tune it extensivley for better performance, even Select queries are taking too much time ,it took 46 minutes for query to execute and same query I had executed in Clickhouse db which took only 2min from same hardware configuration. We where actually expecting to get result in few seconds as we thought Impala should be way faster then Clickhouse. I don't know why "Insert ...into [Partition] Select *" throws error but without partition it works fine. Is there any other method to check. I am actually checking against 500 million records. Thanks
... View more
06-26-2019
01:17 AM
Hi @EricL I think issue here is something different the above steps are not working nor anything which I tried from internet. I dropped table and created it again but its throwing same error. ERROR: AnalysisException: Partition spec does not exist: (year=2009). My create query- create table tbl_id(id int) partitioned by(year int); Here is my tables full details using desc formatted tbl_part- +------------------------------+--------------------------------------------------------------+----------------------+
| name | type | comment |
+------------------------------+--------------------------------------------------------------+----------------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| id | int | NULL |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| year | int | NULL |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | root | NULL |
| CreateTime: | Wed Jun 26 13:33:02 IST 2019 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_part | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | transient_lastDdlTime | 1561536182 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
+------------------------------+--------------------------------------------------------------+----------------------+ I tried with two files one which doesn't contain partition column value- 2009.txt 10
20
30
40 This one contains partition column values- 2010.txt 5,2010
1,2010
3,2010 I had used same query to execute just different parttion values- I also created the the same table by specifying "row format delimited fields terminated by ','". This is very simple thing and it should have worked, I don't know whether there is some configuration issue or something unknown. Do you see anything wrong in above steps ? Meanwhile I will try temp table method to see how does that works. Thanks
... View more
06-25-2019
11:15 PM
Hi @EricL The dynamic part is not that important my concern is how should I upload the data into partitioned table using LOAD, I tried with regular method of directly using - [quickstart.cloudera:21000] > load data inpath '/user/impala/data/id.txt' into table tbl_id_part; But it didn't worked my text file id.txt contained both regular column and partitioned column data. So I tried with method by specifying partitioned value in query- load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010); But this is also not working. So how does the given query works? and on what kind of table, what is the structure of data file it needs ? LOAD DATA INPATH 'hdfs_file_or_directory_path' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)] I have created a basic partitioned table using below query- create table tbl_id_part(id int) partitioned by (year int) row format delimited fields terminated by ','; I am able to use load into regular tables but with partitioned I am getting errors. What should my data file which I will load into this table look like ? What should be my load statement so that I am able to load data into partitioned table. Edit- I have also tried following hive example but I am getting same error mentioned. https://bigdataprogrammers.com/load-files-into-hive-partitioned-table/ Thanks
... View more
06-25-2019
10:40 AM
Hello, I am trying to add data into my table using load data method, however I was getting error so to ease the complexity and understand the concept better I reduced my table however still I am getting error. I have tried two methods both throw same error. There isn't much example available- My create query- create table tbl_id_part(id int) partitioned by (year int) row format delimited fields terminated by ','; Now I tried two methods here - 1. I created a text file id.txt with data 2,2009
10,2009
15,2010
34,2010 My LOAD query- [quickstart.cloudera:21000] > load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010); When I executed above query I got- Query: load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010)
ERROR: AnalysisException: Partition spec does not exist: (year=2010). I thought it must be throwing error id.txt contains year data so I tried with different id1.txt file- 5
10
20
30 My load query- [quickstart.cloudera:21000] > load data inpath '/user/impala/data/id1.txt' into table tbl_id_part partition(year=2010); I am getting same error- Query: load data inpath '/user/impala/data/id1.txt' into table tbl_id_part partition(year=2010)
ERROR: AnalysisException: Partition spec does not exist: (year=2010). My question is how should my data text file look like, the partioned column values should be inside file or not. I am pretty much confuse , please guide. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager
06-24-2019
09:38 PM
I will be getting data from network via snmp or serial. Loading data from file is very fast , I did it manually first to check and it worked however when I try to do the same with my java code I get different errors. I am trying few steps but getting error -- 1. I wrote 1 miilion records into text file through java. 2. I tried to upload the generate text file into HDFS through my java code. 3. I am executing LOAD query through code. However I am unable to execute step 2- 1. I tried java method of ProcessBuilder(), getRuntime.exec() but it didn't worked. Then I tried searching for some API to do following task but I was getting error- java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at HdfsWriter.run(impala_crt.java:935)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at impala_crt.writeFile(impala_crt.java:861)
at impala_crt.main(impala_crt.java:106) Do you have any solution for this or there is some better method ?
... View more
06-20-2019
12:36 AM
Hi @Andrew_Sherman I have accepted this answer but yet one thing bothers me addbatch() is supported so what is the use of addbatch() if I cannot execute it. Thanks
... View more
06-20-2019
12:30 AM
Hi @EricL , I don't know how this will work, I will get continuous data from network, the data is related to NMS. So you mean when I receive data from network , first I write it into file, then copy that file into hdfs. Then use LOAD to load data from that file into my table, is it ?
... View more
06-19-2019
11:33 PM
Hi, I am posting my entire java code here. import com.cloudera.impala.jdbc41.internal.com.cloudera.altus.shaded.org.apache.commons.lang3.RandomStringUtils;
import java.sql.*;
import java.util.Random;
class impala12 {
private static final String CONNECTION_URL = "jdbc:impala://172.16.8.177:21050;SCRATCH_LIMIT=-1" ;
private static final String sqlStatementCreate = "CREATE TABLE if not exists helloworld (message String) STORED AS PARQUET";
private static final String sqlStatementInsert = "INSERT INTO helloworld VALUES (\"helloworld\")";
private static final String sqlCompiledQuery = "INSERT INTO tbl_mindarray (source_ip,destination_ip,protocol_number," +
"source_port,destination_port,packet,volume,duration,pps,bps,bpp,source_latitude,source_longitude," +
"source_city,source_country,destination_latitude,destination_longitude ,destination_city ,destination_country ," +
"ingress_volume ,egress_volume ,ingress_packet ,egress_packet ,source_if_index ,destination_if_index," +
"source_host,event_date,event_time,_time,flow,year)" + " VALUES" +
"(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)";
;
public static void main(String[] args) {
System.out.println("Impala using Java");
writeInABatchWithCompiledQuery(1000);
System.out.println("Done");
}
private static Connection connectViaDS() throws Exception {
Connection connection = null;
Class.forName("com.cloudera.impala.jdbc41.Driver");
connection = DriverManager.getConnection(CONNECTION_URL);
return connection;
}
private static void writeInABatchWithCompiledQuery(int records) {
long e_date= 1275822966, e_time= 1370517366;
PreparedStatement preparedStatement;
// int total = 1000000*1000;
int total = 1000*1000;
int counter =0;
Connection connection = null;
try {
connection = connectViaDS();
connection.setAutoCommit(false);
preparedStatement = connection.prepareStatement(sqlCompiledQuery);
Timestamp ed = new Timestamp(e_date);
Timestamp et = new Timestamp(e_time);
while(counter <total) {
for (int index = 1; index <= records; index++) {
counter++;
preparedStatement.setString(1, ranStr());
// System.out.println(1);
preparedStatement.setString(2, ranStr());
// System.out.println(2);
preparedStatement.setInt(3, ranInt(1,10));
// System.out.println(3);
preparedStatement.setInt(4, ranInt(1,10));
// System.out.println(4);
preparedStatement.setInt(5, ranInt(1,10));
// System.out.println(5);
preparedStatement.setInt(6, ranInt(1,10));
// System.out.println(6);
preparedStatement.setInt(7, ranInt(1,10));
// System.out.println(7);
preparedStatement.setInt(8, ranInt(1,10));
// System.out.println(8);
preparedStatement.setInt(9, ranInt(1,10));
// System.out.println(9);
preparedStatement.setInt(10, ranInt(1,10));
// System.out.println(10);
preparedStatement.setInt(11, ranInt(1,10));
// System.out.println(11);
preparedStatement.setString(12, ranStr());
// System.out.println(12);
preparedStatement.setString(13, ranStr());
// System.out.println(13);
preparedStatement.setString(14, ranStr());
// System.out.println(14);
preparedStatement.setString(15, ranStr());
// System.out.println(15);
preparedStatement.setString(16, ranStr());
// System.out.println(16);
preparedStatement.setString(17, ranStr());
// System.out.println(17);
preparedStatement.setString(18, ranStr());
// System.out.println(18);
preparedStatement.setString(19, ranStr());
// System.out.println(19);
preparedStatement.setInt(20, ranInt(1,10));
// System.out.println(20);
preparedStatement.setInt(21, ranInt(1,10));
// System.out.println(21);
preparedStatement.setInt(22, ranInt(1,10));
// System.out.println(22);
preparedStatement.setInt(23,ranInt(1,10));
// System.out.println(23);
preparedStatement.setInt(24, ranInt(1,10));
// System.out.println(24);
preparedStatement.setInt(25, ranInt(1,10));
// System.out.println(25);
preparedStatement.setString(26, ranStr());
// System.out.println(26);
preparedStatement.setTimestamp(27, ed);
// System.out.println(27);
preparedStatement.setTimestamp(28, et);
// System.out.println(28);
preparedStatement.setInt(29, ranInt(1,10));
// System.out.println(29);
preparedStatement.setInt(30, ranInt(1,10));
preparedStatement.setInt(31,ranInt(1,10));
// System.out.println(30);
// System.out.println(index);
preparedStatement.addBatch();
}
preparedStatement.executeBatch();
preparedStatement.clearBatch();
System.out.println("Counter = "+counter);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
connection.close();
} catch (SQLException e) {
e.printStackTrace();
}
}
}
private static String ranStr() {
String generatedString = RandomStringUtils.randomAlphabetic(5);
return generatedString;
}
private static int ranInt(int min, int max) {
Random r = new Random();
return r.ints(min, (max + 1)).limit(1).findFirst().getAsInt();
}
} I have used random function to generate data. Thanks
... View more
06-19-2019
11:24 PM
Yes, I was unable to paste directly because of character limitation so I am sharing google drive link. https://drive.google.com/drive/folders/1AMF1SzRh4AuHcosfvFxXjSo0W0mdzKAN?usp=sharing Please let me know if more information is required.
... View more
06-19-2019
11:04 PM
It has only 31 columns consisting datatype int, string and timestamp. I had created batch of mere 1k records using Prepared Statement in java, I am using latest jdbcdriver41. Following query I have used to create table- create TABLE IF NOT EXISTS tbl_mindarray (source_ip string,destination_ip string,protocol_number int,source_port int,destination_port int,packet bigint,volume bigint,duration bigint,pps bigint,bps bigint,bpp bigint,source_latitude string,source_longitude string,source_city string,source_country string,destination_latitude string,destination_longitude string,destination_city string,destination_country string,ingress_volume bigint,egress_volume bigint,ingress_packet bigint,egress_packet bigint,source_if_index int,destination_if_index int,source_host string,event_date timestamp,event_time timestamp,_time int,flow bigint) partitioned by (year int) stored as parquet;
... View more
06-19-2019
01:54 AM
Hi,
I am using latest Cloudera VM as on 17June2019, impala version 2.10.0 . I observed INSERT INTO is taking too much time for table which consist of 31 cloumns. Its taking more time in plan waiting .
I am posting my query profile-
Query Info
Query ID: b746ed90a692cef9:d7c51dd400000000
User:
Database: default
Coordinator: quickstart.cloudera
Query Type: DML
Query State: FINISHED
Start Time: Jun 19, 2019 1:07:23 PM
End Time: Jun 19, 2019 1:08:07 PM
Duration: 43.8s
Rows Produced: 1,000
Admission Result: Admitted immediately
Admission Wait Time: 0ms
Aggregate Peak Memory Usage: 389.0 KiB
Client Fetch Wait Time: 25ms
Client Fetch Wait Time Percentage: 0
Estimated per Node Peak Memory: 10.0 MiB
File Formats:
HDFS Bytes Written: 160.0 KiB
Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4)
Memory Accrual: 199,168 byte seconds
Network Address: 172.16.10.72:49084
Node with Peak Memory Usage: quickstart.cloudera:22000
Out of Memory: false
Per Node Peak Memory Usage: 389.0 KiB
Planning Wait Time: 42.45s
Planning Wait Time Percentage: 97
Pool: root.default
Query Status: OK
Rows Inserted: 1,000
Session ID: 794f64ac9b4b1553:14542a16b4511ba2
Session Type: HIVESERVER2
Statistics Corrupt: false
Statistics Missing: false
Threads: CPU Time: 108ms
Threads: CPU Time Percentage: 100
Threads: Network Receive Wait Time: 0ms
Threads: Network Receive Wait Time Percentage: 0
Threads: Network Send Wait Time: 0ms
Threads: Network Send Wait Time Percentage: 0
Threads: Storage Wait Time: 0ms
Threads: Storage Wait Time Percentage: 0
Threads: Total Time: 108ms
Query Timeline
Query submitted: 86.45us (86.45us)
Planning finished: 42.45s (42.45s)
Submit for admission: 42.47s (21ms)
Completed admission: 42.47s (135.69us)
Ready to start on 1 backends: 42.48s (1ms)
All 1 execution backends (1 fragment instances) started: 42.55s (74ms)
DML data written: 43.37s (822ms)
DML Metastore update finished: 43.77s (395ms)
Request finished: 43.79s (20ms)
Unregister query: 43.81s (25ms)
Planner Timeline
Analysis finished: 41.71s (41.71s)
Equivalence classes computed: 41.72s (6ms)
Single node plan created: 41.87s (155ms)
Runtime filters computed: 41.87s (107.52us)
Distributed plan created: 41.88s (8ms)
Lineage info computed: 42.09s (214ms)
Planning finished: 42.27s (179ms)
What could be the reason and how can I solve this ?
Thanks
... View more
- Tags:
- imapla
Labels:
06-18-2019
12:27 AM
I wrote a java code to insert few thousand records by creating batch, I was not getting good performance from PreparedStatements so I switched to Statements and see if there is some performance difference. However after getting error and reading Impala doc I came to know executebatch() is not supported for statements but I am able to use addBatch() so its kind a strange. Here I am posting some of my code- private static Connection connectViaDS() throws Exception {
Connection connection = null;
Class.forName("com.cloudera.impala.jdbc41.DataSource");
connection = DriverManager.getConnection(CONNECTION_URL);
return connection;
}
//inside main--
// I have declared all the variables,I have lots of other data as well so it may create more confusion
try {
connection = connectViaDS();
connection.setAutoCommit(false);
Statement statement = connection.createStatement();
Timestamp ed = new Timestamp(e_date);
Timestamp et = new Timestamp(e_time);
while(counter <total) {
for (int index = 1; index <= records; index++) {
counter++;
String stmt_query = "INSERT INTO tbl_mindarray (source_ip,destination_ip,protocol_number" + index +"," +
"source_port" + index + ",destination_port" + index + ",packet,volume" + index + ",duration" + index + "," +
"pps" + index + ",bps" + index + ",bpp" + index + ",source_latitude,source_longitude," +
"source_city,source_country,destination_latitude,destination_longitude ,destination_city ,destination_country ," +
"ingress_volume" + index + " ,egress_volume " + index + ",ingress_packet ,egress_packet " + index + ",source_if_index ,destination_if_index," +
"source_host,event_date,event_time,_time,flow,year)" + "VALUES ('s_ip','d_ip',234,23,56,556,34,27,46,389,76," +
"'123123d','456456c','Damouli','Nepal','234234d','678678c','Mumbai','India',123,456,786,324,544,23,'192.168.1.44'," +
"345" + index +",345" + index +",2010" + index +")";
statement.addBatch(stmt_query);
}
statement.executeBatch();
statement.clearBatch();
//connection.commit();
System.out.println("Counter = "+counter);
//flag++;
} My question is how do I executebatch() is its not supported, addbatch() is supported so there must be something to execute it. Here is the error which I am getting java.sql.SQLException: Error message not found: NOT_IMPLEMENTED. Can't find resource for bundle java.util.PropertyResourceBundle, key NOT_IMPLEMENTED
at com.cloudera.impala.hivecommon.dataengine.HiveJDBCDataEngine.prepareBatch(Unknown Source)
at com.cloudera.impala.jdbc.common.SStatement.executeAnyBatch(Unknown Source)
at com.cloudera.impala.jdbc.common.SStatement.executeBatch(Unknown Source)
at impala_crt.writeInABatchWithCompiledQuery(impala_crt.java:278)
Caused by: com.cloudera.impala.support.exceptions.GeneralException: Error message not found: NOT_IMPLEMENTED. Can't find resource for bundle java.util.PropertyResourceBundle, key NOT_IMPLEMENTED
... 4 more Somewhere I read each prepared statement creates a file into hdfs so I believe statement should be faster as I can insert all row values using just one statement. It would be really helpful if someone can share there knowledge on this. Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager
06-17-2019
11:37 PM
Yes I had seen the document earlier before posting here, I had batch size of 10,000 which took several minutes to process so I started reducing the batch at the point where I am able to insert the record in one sec, batch size reduced to 100. My question is what changes I have to do to get insertion speed of 10k/sec records, I can increase batch size to 10000 but I won't get the speed. What are the bare minimum settings required to get these speed ?, this isn't fast as I have insert upto 40k/sec records in mysql from same system but for Impala I believe it works differently and somewhere my lack of understanding/knowledge is part of the problem. I cannot use INSERT ... SELECT or LOAD as I will be getting live data into variables so I need to use INSERT INTO.. Is my java method of using PreparedStatements correct or there is some other better way.
... View more
06-17-2019
01:46 AM
Hi, I am trying to insert mere 10k record batch however its taking very long time, so I reduced my batch to 1000 records yet it took nearly 14 seconds. I don't know whats wrong below is my java code which I tried - Currently I am able t oinsert 200/sec records however I need to have atleast 10k/sec records. import java.sql.*;
class impala12 {
private static final String CONNECTION_URL = "jdbc:impala://172.16.8.177:21050" ;
private static final String sqlStatementCreate = "CREATE TABLE if not exists helloworld (message String) STORED AS PARQUET";
private static final String sqlStatementInsert = "INSERT INTO helloworld VALUES (\"helloworld\")";
private static final String sqlCompiledQuery = "INSERT INTO tbl_mindarray (source_ip,destination_ip,protocol_number," +
"source_port,destination_port,packet,volume,duration,pps,bps,bpp,source_latitude,source_longitude," +
"source_city,source_country,destination_latitude,destination_longitude ,destination_city ,destination_country ," +
"ingress_volume ,egress_volume ,ingress_packet ,egress_packet ,source_if_index ,destination_if_index," +
"source_host,event_date,event_time,_time,flow,year)" + " VALUES" +
"(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)";
;
//,event_date,event_time
//,?,?,?,?,?,?,?
//source_longitude,source_city," +
// "source_country,destination_latitude,destination_longitude ,destination_city ,destination_country
public static void main(String[] args) {
System.out.println("Impala using Java");
writeInABatchWithCompiledQuery(100);
System.out.println("Done");
}
private static Connection connectViaDS() throws Exception {
Connection connection = null;
Class.forName("com.cloudera.impala.jdbc41.Driver");
connection = DriverManager.getConnection(CONNECTION_URL);
return connection;
}
private static void writeInABatchWithCompiledQuery(int records) {
int protocol_no = 233,s_port=20,d_port=34,packet=46,volume=58,duration=39,pps=76,
bps=65,bpp=89,i_vol=465,e_vol=345,i_pkt=5,e_pkt=54,s_i_ix=654,d_i_ix=444,_time=1000,flow=989;
int yr = 1951;
int flag = 0;
String s_city = "Mumbai",s_country = "India", s_latt = "12.165.34c", s_long = "39.56.32d",
s_host="motadata",d_latt="29.25.43c",d_long="49.15.26c",d_city="Damouli",d_country="Nepal";
long e_date= 1275822966, e_time= 1370517366;
PreparedStatement preparedStatement;
// int total = 1000000*1000;
int total = 1000*1000;
int counter =0;
Connection connection = null;
try {
connection = connectViaDS();
connection.setAutoCommit(false);
preparedStatement = connection.prepareStatement(sqlCompiledQuery);
Timestamp ed = new Timestamp(e_date);
Timestamp et = new Timestamp(e_time);
while(counter <total) {
for (int index = 1; index <= records; index++) {
counter++;
preparedStatement.setString(1, "s_ip" + String.valueOf(index));
// System.out.println(1);
preparedStatement.setString(2, "d_ip" + String.valueOf(index));
// System.out.println(2);
preparedStatement.setInt(3, protocol_no + index);
// System.out.println(3);
preparedStatement.setInt(4, s_port + index);
// System.out.println(4);
preparedStatement.setInt(5, d_port + index);
// System.out.println(5);
preparedStatement.setInt(6, packet + index);
// System.out.println(6);
preparedStatement.setInt(7, volume + index);
// System.out.println(7);
preparedStatement.setInt(8, duration + index);
// System.out.println(8);
preparedStatement.setInt(9, pps + index);
// System.out.println(9);
preparedStatement.setInt(10, bps + index);
// System.out.println(10);
preparedStatement.setInt(11, bpp + index);
// System.out.println(11);
preparedStatement.setString(12, s_latt + String.valueOf(index));
// System.out.println(12);
preparedStatement.setString(13, s_long + String.valueOf(index));
// System.out.println(13);
preparedStatement.setString(14, s_city + String.valueOf(index));
// System.out.println(14);
preparedStatement.setString(15, s_country + String.valueOf(index));
// System.out.println(15);
preparedStatement.setString(16, d_latt + String.valueOf(index));
// System.out.println(16);
preparedStatement.setString(17, d_long + String.valueOf(index));
// System.out.println(17);
preparedStatement.setString(18, d_city + String.valueOf(index));
// System.out.println(18);
preparedStatement.setString(19, d_country + String.valueOf(index));
// System.out.println(19);
preparedStatement.setInt(20, i_vol + index);
// System.out.println(20);
preparedStatement.setInt(21, e_vol + index);
// System.out.println(21);
preparedStatement.setInt(22, i_pkt + index);
// System.out.println(22);
preparedStatement.setInt(23, e_pkt + index);
// System.out.println(23);
preparedStatement.setInt(24, s_i_ix + index);
// System.out.println(24);
preparedStatement.setInt(25, d_i_ix + index);
// System.out.println(25);
preparedStatement.setString(26, s_host + String.valueOf(index));
// System.out.println(26);
preparedStatement.setTimestamp(27, ed);
// System.out.println(27);
preparedStatement.setTimestamp(28, et);
// System.out.println(28);
preparedStatement.setInt(29, _time);
// System.out.println(29);
preparedStatement.setInt(30, flow + index);
preparedStatement.setInt(31, yr + flag);
// System.out.println(30);
// System.out.println(index);
preparedStatement.addBatch();
}
preparedStatement.executeBatch();
preparedStatement.clearBatch();
//connection.commit();
System.out.println("Counter = "+counter);
//flag++;
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
connection.close();
} catch (SQLException e) {
e.printStackTrace();
}
}
}
} I am using latest Cloudera Vm provided as on 17June2019. Statements are not working I am getting error, as executeBatch() is not supported. I tried increasing Daemon Limit and Heap but it didn't worked. I had earlier posted this but I don't know someone reported as spam or something and it got deleted. I am reposting it. Thanks
... View more
- Tags:
- impala java
Labels:
- Labels:
-
Apache Impala
-
Cloudera Manager