Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What performance to expect from Cloudera VM ?

avatar
Explorer

Hello,

After many days of POC and some tuning I am getting some better results however the performance seems less then what I was expecting. I understand the need of better hardware as Impala is meant for bigdata however I also need to know its performance on some regular systems.

 

For this I created a parquet partitioned table and filled it with 500million data, I feel query is taking twice the time it should take.

I had set PARQUET_FILE_SIZE = 512MB

 

Details of my table -

tbl_parq_512m

 

+------------------------------+-------------------------------------------------------------------+----------------------+
| name                         | type                                                              | comment              |
+------------------------------+-------------------------------------------------------------------+----------------------+
| # col_name                   | data_type                                                         | comment              |
|                              | NULL                                                              | NULL                 |
| protocol_number              | int                                                               | NULL                 |
| source_port                  | int                                                               | NULL                 |
| destination_port             | int                                                               | NULL                 |
| packet                       | int                                                               | NULL                 |
| volume                       | int                                                               | NULL                 |
| duration                     | int                                                               | NULL                 |
| pps                          | int                                                               | NULL                 |
| bps                          | int                                                               | NULL                 |
| bpp                          | int                                                               | NULL                 |
| source_latitude              | string                                                            | NULL                 |
| source_longitude             | string                                                            | NULL                 |
| source_city                  | string                                                            | NULL                 |
| source_country               | string                                                            | NULL                 |
| destination_latitude         | string                                                            | NULL                 |
| destination_longitude        | string                                                            | NULL                 |
| destination_city             | string                                                            | NULL                 |
| destination_country          | string                                                            | NULL                 |
| ingress_volume               | int                                                               | NULL                 |
| egress_volume                | int                                                               | NULL                 |
| ingress_packet               | int                                                               | NULL                 |
| egress_packet                | int                                                               | NULL                 |
| source_if_index              | int                                                               | NULL                 |
| destination_if_index         | int                                                               | NULL                 |
| source_host                  | string                                                            | NULL                 |
| flow                         | int                                                               | NULL                 |
|                              | NULL                                                              | NULL                 |
| # Partition Information      | NULL                                                              | NULL                 |
| # col_name                   | data_type                                                         | comment              |
|                              | NULL                                                              | NULL                 |
| source_ip                    | string                                                            | NULL                 |
| destination_ip               | string                                                            | NULL                 |
| year                         | int                                                               | NULL                 |
| event_date                   | string                                                            | NULL                 |
|                              | NULL                                                              | NULL                 |
| # Detailed Table Information | NULL                                                              | NULL                 |
| Database:                    | default                                                           | NULL                 |
| Owner:                       | root                                                              | NULL                 |
| CreateTime:                  | Fri Jun 28 17:08:24 IST 2019                                      | NULL                 |
| LastAccessTime:              | UNKNOWN                                                           | NULL                 |
| Protect Mode:                | None                                                              | NULL                 |
| Retention:                   | 0                                                                 | NULL                 |
| Location:                    | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m | NULL                 |
| Table Type:                  | MANAGED_TABLE                                                     | NULL                 |
| Table Parameters:            | NULL                                                              | NULL                 |
|                              | STATS_GENERATED_VIA_STATS_TASK                                    | true                 |
|                              | transient_lastDdlTime                                             | 1561734077           |
|                              | NULL                                                              | NULL                 |
| # Storage Information        | NULL                                                              | NULL                 |
| SerDe Library:               | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe       | NULL                 |
| InputFormat:                 | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat     | NULL                 |
| OutputFormat:                | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat    | NULL                 |
| Compressed:                  | No                                                                | NULL                 |
| Num Buckets:                 | 0                                                                 | NULL                 |
| Bucket Columns:              | []                                                                | NULL                 |
| Sort Columns:                | []                                                                | NULL                 |
+------------------------------+-------------------------------------------------------------------+----------------------+

 

My data using table stats-

 

[quickstart.cloudera:21000] > show table stats tbl_parq_512m;
Query: show table stats tbl_parq_512m
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| source_ip | destination_ip | year | event_date | #Rows | #Files | Size     | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                                                                                       |
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 10.0.0.1  | 20.0.0.1       | 1990 | 1/1/1990   | -1    | 1      | 337.76MB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 |
| 30.0.0.1  | 40.0.0.1       | 1993 | 1/1/1993   | -1    | 3      | 1.32GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 |
| 41.0.0.1  | 45.0.0.1       | 1994 | 1/1/1994   | -1    | 3      | 1.32GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 |
| 46.0.0.1  | 50.0.0.1       | 1995 | 1/1/1995   | -1    | 7      | 3.23GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 |
| 51.0.0.1  | 55.0.0.1       | 1996 | 1/1/1996   | -1    | 7      | 3.30GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 |
| 56.0.0.1  | 60.0.0.1       | 1997 | 1/1/1997   | -1    | 7      | 3.30GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 |
| 61.0.0.1  | 70.0.0.1       | 1998 | 1/1/1998   | -1    | 14     | 6.59GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 |
| 71.0.0.1  | 75.0.0.1       | 1999 | 1/1/1999   | -1    | 7      | 3.30GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 |
| 80.0.0.1  | 85.0.0.1       | 2000 | 1/1/2000   | -1    | 7      | 3.30GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 |
| 86.0.0.1  | 90.0.0.1       | 2001 | 1/1/2001   | -1    | 7      | 3.30GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 |
| 91.0.0.1  | 95.0.0.1       | 2002 | 1/1/2002   | -1    | 11     | 5.47GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 |
| Total     |                |      |            | -1    | 74     | 34.74GB  | 0B           |                   |         |                   |                                                                                                                                                |
+-----------+----------------+------+------------+-------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+

 

Block size details -

 

[root@quickstart home]# hdfs fsck -blocks /user/hive/warehouse/tbl_parq_512m;
Connecting to namenode via http://quickstart.cloudera:50070/fsck?ugi=root&blocks=1&path=%2Fuser%2Fhive%2Fwarehouse%2Ftbl_parq_512m
FSCK started by root (auth:SIMPLE) from /172.16.8.177 for path /user/hive/warehouse/tbl_parq_512m at Tue Jul 02 00:47:19 IST 2019
..........................................................................Status: HEALTHY
 Total size:	37305601105 B
 Total dirs:	46
 Total files:	74
 Total symlinks:		0
 Total blocks (validated):	74 (avg. block size 504129744 B)
 Minimally replicated blocks:	74 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	1
 Average block replication:	1.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		1
 Number of racks:		1
FSCK ended at Tue Jul 02 00:47:19 IST 2019 in 15 milliseconds


The filesystem under path '/user/hive/warehouse/tbl_parq_512m' is HEALTHY

 

My query and result-

 

[quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10;
Query: select source_ip,destination_ip,avg(volume) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10
Query submitted at: 2019-07-01 18:24:10 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=fd40c639c5e21365:3cbcfc1400000000
+-----------+----------------+-------------------+
| source_ip | destination_ip | avl               |
+-----------+----------------+-------------------+
| 41.0.0.1  | 45.0.0.1       | 4.498248032480324 |
| 46.0.0.1  | 50.0.0.1       | 4.499559623845538 |
| 30.0.0.1  | 40.0.0.1       | 4.499744347443475 |
| 91.0.0.1  | 95.0.0.1       | 4.499926612990091 |
| 80.0.0.1  | 85.0.0.1       | 4.499946078921578 |
| 10.0.0.1  | 20.0.0.1       | 4.499997599976    |
| 51.0.0.1  | 55.0.0.1       | 4.500039240784815 |
| 61.0.0.1  | 70.0.0.1       | 4.500189103782076 |
| 86.0.0.1  | 90.0.0.1       | 4.500191183823676 |
| 71.0.0.1  | 75.0.0.1       | 4.500268505370108 |
+-----------+----------------+-------------------+
Fetched 10 row(s) in 34.70s

As we can see its takes on average 34 seconds however I was expecting it to be quicker then this.

 

I am using the latest VM shipped and Cloudera Express.

 

I have increased my RAM to 10gb and harddisk to 250gb.

 

What more tunning can be done to improve performance, Do I need to tune CentOS ?

 

Thanks

 

 

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

HI @punshi 

Do you use the QuickStart VM for CDH 5.13?
Indeed, the VM is just for testing !! To unlock the real power of the impala (CDH), you should have a cluster so that you can benefit from the senergy of several nodes.

Anyway, you can improve your query time by:
1- Set PARQUET_FILE_SIZE = 256MB instead of 512MB.
2- Try to minimize the number of partitions (in fact, I think in your case, the year is sufficient).


NB: I think that in a cluster that has more than 10 nodes, this request will not exceed 2 to 4 seconds.

Good luck.

View solution in original post

2 REPLIES 2

avatar
Master Collaborator

HI @punshi 

Do you use the QuickStart VM for CDH 5.13?
Indeed, the VM is just for testing !! To unlock the real power of the impala (CDH), you should have a cluster so that you can benefit from the senergy of several nodes.

Anyway, you can improve your query time by:
1- Set PARQUET_FILE_SIZE = 256MB instead of 512MB.
2- Try to minimize the number of partitions (in fact, I think in your case, the year is sufficient).


NB: I think that in a cluster that has more than 10 nodes, this request will not exceed 2 to 4 seconds.

Good luck.

avatar
Explorer

Thanks @AcharkiMed 

 

I tried that there was no improvement however after enabling hyper threading I was able to reduce it to 25sec from 40. I tried my hands on HDFS cache however even after defining cache_pool size to 3gb only 1 gb data gets cached, Any idea ?

 

Query: show table stats tbl_parq_123
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
| year  | #Rows | #Files | Size     | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                   |
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
| 1990  | -1    | 2      | 338.45MB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1990 |
| 1993  | -1    | 6      | 1.32GB   | 0B           | 1                 | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1993 |
| 1994  | -1    | 6      | 1.32GB   | 1010.95MB    | 1                 | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1994 |
| 1995  | -1    | 14     | 3.24GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1995 |
| 1996  | -1    | 14     | 3.30GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1996 |
| 1997  | -1    | 14     | 3.30GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1997 |
| 1998  | -1    | 27     | 6.60GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1998 |
| 1999  | -1    | 14     | 3.30GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1999 |
| 2000  | -1    | 14     | 3.30GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2000 |
| 2001  | -1    | 14     | 3.30GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2001 |
| 2002  | -1    | 23     | 5.48GB   | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2002 |
| Total | -1    | 148    | 34.79GB  | 1010.95MB    |                   |         |                   |                                                                            |
+-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
F

 

[root@quickstart ~]# hdfs cacheadmin -listPools
Found 1 result.
NAME            OWNER   GROUP  MODE             LIMIT  MAXTTL
three_gig_pool  impala  hdfs   rwxr-xr-x   3000000000   never


Thanks