Hello,
I am trying to use HDFS cache to see some performance improvement however what I see is I am unable to use full cache pool defined.
I tried refresh table statement, I tried creating a new table with cache defined from start but none of it worked.
I had removed cache and assign again but no difference.
My table stats-
Query: show table stats tbl_parq_123 +-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+ | year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+ | 1990 | -1 | 2 | 338.45MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1990 | | 1993 | -1 | 6 | 1.32GB | 0B | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1993 | | 1994 | -1 | 6 | 1.32GB | 1010.95MB | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1994 | | 1995 | -1 | 14 | 3.24GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1995 | | 1996 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1996 | | 1997 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1997 | | 1998 | -1 | 27 | 6.60GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1998 | | 1999 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1999 | | 2000 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2000 | | 2001 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2001 | | 2002 | -1 | 23 | 5.48GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2002 | | Total | -1 | 148 | 34.79GB | 1010.95MB | | | | | +-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+
Pool size-
[root@quickstart ~]# hdfs cacheadmin -listPools Found 1 result. NAME OWNER GROUP MODE LIMIT MAXTTL three_gig_pool impala hdfs rwxr-xr-x 3000000000 never
My method-
[quickstart.cloudera:21000] > alter table tbl_parq_123 set cached in 'three_gig_pool'; Query: alter table tbl_parq_123 set cached in 'three_gig_pool' +---------------+ | summary | +---------------+ | Cached table. | +---------------+ Fetched 1 row(s) in 1.98s
Sometime cached data would be 500mb, 800mb but it never crossed 1gb. Is there any parameter or something which I need to check ?
Thanks
Created 07-03-2019 11:03 AM
Hi @punshi
How much cache space you have configured? please try this hdfs command to display the details of cache configured and used.
hdfs dfsadmin -report
Created on 07-03-2019 11:55 PM - edited 07-03-2019 11:56 PM
Thanks @AcharkiMed
That command really shows clear picture now but I still don't know why its value is set to 1gb.
Configured Cache capacity - 1gb
[root@quickstart ~]# sudo -u hdfs hdfs dfsadmin -report Configured Capacity: 250717949952 (233.50 GB) Present Capacity: 208179929088 (193.88 GB) DFS Remaining: 88135348224 (82.08 GB) DFS Used: 120044580864 (111.80 GB) DFS Used%: 57.66% Under replicated blocks: 4 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (1): Name: 172.16.8.177:50010 (quickstart.cloudera) Hostname: quickstart.cloudera Rack: /default Decommission Status : Normal Configured Capacity: 250717949952 (233.50 GB) DFS Used: 120044580864 (111.80 GB) Non DFS Used: 29259272192 (27.25 GB) DFS Remaining: 88135348224 (82.08 GB) DFS Used%: 47.88% DFS Remaining%: 35.15% Configured Cache Capacity: 1073741824 (1 GB) Cache Used: 1060061184 (1010.95 MB) Cache Remaining: 13680640 (13.05 MB) Cache Used%: 98.73% Cache Remaining%: 1.27% Xceivers: 2 Last contact: Thu Jul 04 12:04:21 IST 2019
I checked hdfs-default.xml to see some parameter defining this value but couldn't find.
I saw one parameter dfs.datanode.max.locked.memory=0, but I feel its different.
Is it automatic and depended on RAM or I can configure it.
Thanks
Created 07-04-2019 02:16 AM
Hi @punshi
Yes you can change it by editing this parameter:
dfs.datanode.max.locked.memory
But you need to know that data caching has moved from HDFS to memory (RAM), so you can not increase it considerably!
Created 07-04-2019 05:40 AM
Thanks @AcharkiMed
I am still thinking if I am providing cache pool 3gb using hdfs cacheadmin so why does it allocates only 1gb. Is it to do with RAM size? I was actually thinking the hdfs cache is using space from my hard disk and not RAM.
Created 07-04-2019 07:29 AM
hi @punshi
Try to read this to get more info about HDFS caching in Impala:
https://www.cloudera.com/documentation/enterprise/5-16-x/topics/impala_perf_hdfs_caching.html