Created on 02-17-2017 09:23 PM
Centralized Cache Management in HDFS is a mechanism that explicitly caches specific files or directories in memory for improved performance. This is useful for relatively small files that are accessed repeatedly. For example, reference/lookup tables or fact tables that are used in many joins. Once enabled, HDFS will automatically cache selected files, and periodically check for changes and recache the files.
While HDFS and the underlying file system do some caching of files when memory is available, explicit caching using Centralized Cache Management prevents the data from being evicted from memory when processes consume all of the physical memory. As a corollary of this, if you ARE working on a lightly loaded system where there is free memory, you may not see any performance improvement from this method, as the data was already in disk cache. So, your performance testing needs to stress the system.
Let’s look at some key terms and concepts:
A cache pool is an administrative entity used to manage groups of cache directives. One of the key attributes of the pool it the maximum number of bytes that can be cached for all directives in this pool.
Cache pools can be managed from the command line using the hdfs cacheadmin utility. Some common commands include:
hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>] hdfs cacheadmin -listPools -stats hdfs cacheadmin -removeDirective <id>hdfs cacheadmin -listDirectives [-stats] [-path <path>][-pool <pool>]
A cache directive defines a path that should be cached. This can be either a specific file or a single directory. Note that directives are not recursive—They apply to a single directory only, not any sub-directories. So, they would usually be applied to the lowest level directory that contains the actual data files.
Cache directives can be managed from the command line using the hdfs cacheadmin utility. Some common commands include:
hdfs cacheadmin -addPool <name> [-owner <owner>] [-group <group>] [-mode <mode>] [-limit <limit>][-maxTtl <maxTtl> hdfs cacheadmin -removePool <name> hdfs cacheadmin -listPools [-stats] [<name>]
There is really only one Hadoop configuration setting that is required to turn on Centralized Caching. There are a few others to control the frequency that caching looks for new files, which you can usually leave at default. The following, which is added to the custom hdfs-site.xml, specifies the maximum number of bytes that can be cached on each datanode.
dfs.datanode.max.locked.memory
Remember that this value is in bytes, in contrast with the OS limits which are set in KB.
Before you implement Centralized Caching, you need to ensure that the locked memory setting on each of the datanodes is set to a value equal or greater than memory specified in the hdfs dfs.datanode.max.locked.memory. On each datanode run the following to determine the limit for locked memory. This will return a value in KB or “unlimited”.
ulimit -l
To set this, you need to add the following to /etc/security/limits.conf. This is for Centos/Red Hat and may be different if you are using another Linux distro. The setting will take effect when you log out of a terminal session and log back in.
* hard memlock 1048576 * soft memlock 1048576
Let’s walk through an example.
1. Set memlock limits on each datanode. This will take effect after you logout and login again.
# On each datanode (max cacheable memory in KB) example for 1.0 GB echo "* hard memlock 1048576" >> /etc/security/limits.conf echo "* soft memlock 1048576" >> /etc/security/limits.conf
2. Create a folder to be cached
hadoop fs -mkdir -p /cache/usstates hadoop fs -chmod -R 777 /cache
3. Create a Cache Pool
hdfs cacheadmin -addPool testPool -mode 0777 -maxTtl never
4. Create one or more Cache Directives
hdfs cacheadmin -addDirective -path /cache/usstates -pool testPool -replication 3 -ttl never
5. Change HDFS configurations to support Caching. Add the following to HDFS configs, Custom hdfs-site.xml in Ambari. This example is for 0.5 GB (in bytes)
dfs.datanode.max.locked.memory=536870912
6. Restart HDFS using Ambari
7. Get some test data and load into cached directory
wget http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv # Strip the column headers and double quotes tail -n +2 states.csv > usstates.csv sed -i 's/\"//g' usstates.csv hadoop fs -put –f usstates.csv /cache/usstates
8. Look for cached data. You should see a value in BYTES_CACHED and FILES_CACHED.
hdfs cacheadmin -listDirectives -stats Found 1 entry ID POOL REPL EXPIRY PATH BYTES_NEEDED BYTES_CACHED FILES_NEEDED FILES_CACHED 1 testPool 3 never /cache/usstates 2550 2550 1 1
9. Query the data with Hive (Optional)
CREATE EXTERNAL TABLE states (state_name string, state_cd string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/cache/usstates'; select * from usstates;
After that you can update the file (add some dummy states) and re-upload to hadoop, and verify that the changed are picked up, add additional folders, files, etc. and generally experiment. You can performance test on your particular system, remembering that you may not see much difference unless the memory is used forcing the normal disk cache to be evicted from memory.
References:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_hdfs-administration/content/ch03.html
Created on 02-19-2017 09:31 PM
Do I need to use "*" to increase memlock?
Could I use "hdfs" instead?
Created on 02-23-2017 04:46 PM
Yes, if you want to be more restrictive you could use the user hdfs or @hadoop to indicate any user in the hadoop group.