Community Articles

Jim_B · ‎02-17-2017

Centralized Cache Management in HDFS is a mechanism that explicitly caches specific files or directories in memory for improved performance. This is useful for relatively small files that are accessed repeatedly. For example, reference/lookup tables or fact tables that are used in many joins. Once enabled, HDFS will automatically cache selected files, and periodically check for changes and recache the files.

While HDFS and the underlying file system do some caching of files when memory is available, explicit caching using Centralized Cache Management prevents the data from being evicted from memory when processes consume all of the physical memory. As a corollary of this, if you ARE working on a lightly loaded system where there is free memory, you may not see any performance improvement from this method, as the data was already in disk cache. So, your performance testing needs to stress the system.

Let’s look at some key terms and concepts:

Cache Pools

A cache pool is an administrative entity used to manage groups of cache directives. One of the key attributes of the pool it the maximum number of bytes that can be cached for all directives in this pool.

Cache pools can be managed from the command line using the hdfs cacheadmin utility. Some common commands include:

hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>]

hdfs cacheadmin -listPools -stats 

hdfs cacheadmin -removeDirective <id>hdfs cacheadmin -listDirectives [-stats] [-path <path>][-pool <pool>]

Cache Directives

A cache directive defines a path that should be cached. This can be either a specific file or a single directory. Note that directives are not recursive—They apply to a single directory only, not any sub-directories. So, they would usually be applied to the lowest level directory that contains the actual data files.

Cache directives can be managed from the command line using the hdfs cacheadmin utility. Some common commands include:

hdfs cacheadmin -addPool <name> [-owner <owner>] [-group <group>] [-mode <mode>] [-limit <limit>][-maxTtl <maxTtl>

hdfs cacheadmin -removePool <name>

hdfs cacheadmin -listPools [-stats] [<name>]

HDFS Configuration Settings

There is really only one Hadoop configuration setting that is required to turn on Centralized Caching. There are a few others to control the frequency that caching looks for new files, which you can usually leave at default. The following, which is added to the custom hdfs-site.xml, specifies the maximum number of bytes that can be cached on each datanode.

dfs.datanode.max.locked.memory

Remember that this value is in bytes, in contrast with the OS limits which are set in KB.

OS Limits

Before you implement Centralized Caching, you need to ensure that the locked memory setting on each of the datanodes is set to a value equal or greater than memory specified in the hdfs dfs.datanode.max.locked.memory. On each datanode run the following to determine the limit for locked memory. This will return a value in KB or “unlimited”.

ulimit -l

To set this, you need to add the following to /etc/security/limits.conf. This is for Centos/Red Hat and may be different if you are using another Linux distro. The setting will take effect when you log out of a terminal session and log back in.

*  hard  memlock 1048576

*  soft  memlock 1048576

How to Implement

Let’s walk through an example.

1. Set memlock limits on each datanode. This will take effect after you logout and login again.

# On each datanode (max cacheable memory in KB) example for 1.0 GB

echo "* hard  memlock 1048576" >> /etc/security/limits.conf

echo "* soft  memlock 1048576" >> /etc/security/limits.conf

2. Create a folder to be cached

hadoop fs -mkdir -p /cache/usstates

hadoop fs -chmod -R 777 /cache

3. Create a Cache Pool

hdfs cacheadmin -addPool testPool -mode 0777 -maxTtl never

4. Create one or more Cache Directives

hdfs cacheadmin -addDirective -path /cache/usstates -pool testPool -replication 3 -ttl never

5. Change HDFS configurations to support Caching. Add the following to HDFS configs, Custom hdfs-site.xml in Ambari. This example is for 0.5 GB (in bytes)

dfs.datanode.max.locked.memory=536870912

6. Restart HDFS using Ambari

7. Get some test data and load into cached directory

wget http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv

# Strip the column headers and double quotes

tail -n +2 states.csv > usstates.csv

sed -i 's/\"//g' usstates.csv

hadoop fs -put –f usstates.csv /cache/usstates

8. Look for cached data. You should see a value in BYTES_CACHED and FILES_CACHED.

hdfs cacheadmin -listDirectives -stats

Found 1 entry
ID  POOL     REPL EXPIRY PATH		 BYTES_NEEDED	BYTES_CACHED FILES_NEEDED FILES_CACHED
1   testPool 3    never  /cache/usstates 2550		2550 	     1            1

9. Query the data with Hive (Optional)

CREATE EXTERNAL TABLE states (state_name string, state_cd string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

STORED AS TEXTFILE

LOCATION '/cache/usstates';

select * from usstates;

After that you can update the file (add some dummy states) and re-upload to hadoop, and verify that the changed are picked up, add additional folders, files, etc. and generally experiment. You can performance test on your particular system, remembering that you may not see much difference unless the memory is used forcing the normal disk cache to be evicted from memory.

References:

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.htm...

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_hdfs-administration/content/ch03.html

hosako · ‎02-19-2017

Do I need to use "*" to increase memlock?

Could I use "hdfs" instead?

Jim_B · ‎02-23-2017

Yes, if you want to be more restrictive you could use the user hdfs or @hadoop to indicate any user in the hadoop group.

Cloudera Community