Member since
09-02-2014
8
Posts
0
Kudos Received
0
Solutions
05-16-2024
01:36 PM
Introduction In this previous article, we talked about the benefits of deploying COD on Cloud Storage, covering the challenges involved with maintaining the consistency of HBase data and the different attempts to solve that problem, with a detailed technical description of the Store File Tracking (SFT) feature we've now been using for COD. Now we can successfully move our dataset to the cloud storage file system of preference, decoupling compute from storage and benefiting from optimal resource usage and independent scale. Moving the data away from RegionServers, however, comes with a (performance) cost: the additional latency overhead of having to access the cloud storage. Our initial benchmarks showed that COD over cloud storage performed, on average, five times slower than COD clusters using HDFS on block storage with HDD (for a specific vendor). HBase already ships built-in cache capabilities, including support for file-based caching. Also all major cloud providers offer instance types with ephemeral volumes that are cheaper than block storage disks. With that in mind, we have defined a COD architecture combining these two features, and found that it could achieve performance parity when comparing it to COD with HDFS on block storage, but at lower cost. Although promising, when we started to apply this solution over real, large scale production workloads, we faced some challenges, indicating that the current HBase caching feature required a few improvements. In this article, we'll briefly review the Apache HBase Cache concepts, how it fits in the COD on cloud storage architecture, describe the problems encountered with this approach and the solutions developed to mitigate these issues. HBase Cache review Caching has been around in HBase since its early releases, and there are two main types of cache: the Memstore cache and the Block cache. The Memstore is an "on-heap memory" cache of all recent data inserted in HBase. As this is part of the RegionServer JVM Heap, space there is limited, and therefore, data on this cache is short-lived. Once the memstore is full, its data gets flushed into HBase files (hfiles) in the filesystem. The Block cache is where hfiles content gets cached. It gets this name because hfiles are logically divided internally into smaller portions called "blocks", and these are placed individually in the cache. Block cache has different implementations. Originally, it was always an "on-heap" cache similar to memstore, but later on, an optional implementation has been introduced by HBASE-7404. This splits caching into two layers, with the first layer (L1) being always placed "on-heap", and the second layer (L2) with multiple, configurable storage options out of the heap. This was made the default for Cloudera distributions based on HBase 2.0.0 release. Blocks can have user data, making it a DATA block, or it may have index information to other blocks in the file for efficient seeks, making it a META block. META blocks are always stored in the L1 cache, whilst DATA blocks are cached in the L2 cache. The L2 cache implementation is known as BucketCache and supports the following storage options: offheap (direct memory access), local file, memory-mapped file, and persistent memory. COD Caching over ephemeral storage As mentioned above, the BlockCache L2 implementation (BucketCache) allows for caching data in a local file. All major cloud providers offer instance types with NVMe SSDs volumes attached. These volumes are ephemeral, meaning data isn't retained upon the instance restarts, but are offered at a much lower cost than block storage of equivalent size. We can then use this ephemeral storage to accommodate all our dataset (on multiple region servers) in a file based BucketCache on the ephemeral storage volume, keeping similar performance to deployments using block storage, but at a much lower cost. The fact that these volumes are ephemeral doesn't create a consistency problem because all the data is backed up in the cloud storage. The diagram below illustrates this architecture. Challenges Cache Index Size The BucketCache implementation being used as the L2 cache keeps an in-memory index structure to map all blocks cached on the disk. This translates into individual Java objects on the RegionServer heap for each of those blocks. By default, HBase keeps block size around 64KB. When experimenting with different cache and heap sizes, our latest benchmarks showed good results when deploying RegionServers with 31GB of heap size and a file-based BucketCache of 1.6TB size on the ephemeral storage volume. However, when testing this architecture against one of our customers' real use case data sets, we observed a large heap usage and GC activity, affecting the system performance. A heap dump analysis revealed that the BuckeCache index objects dominated the heap allocation. We then checked the L2 metrics and realized that the number of blocks was disproportionately higher than what we have seen in our benchmarks, despite the cache size and usage being similar. This was an indication that the customer dataset had smaller blocks than the 64KB default and we could confirm that suspicion by checking some of the files with the HFilePrettyPrinter tool. Files of the same size had many more blocks (compared to our own test dataset), and we could also see that the blocks were getting compressed to as low as 6KB (90% compression) in size, which explained the different behavior from our tests: we were not using any compression or encoding in our test dataset. Further code analysis of the block writing code path gave us a clear picture: when writing blocks, HBase was considering the raw data size, "closing" a block when this size reached the configured size limit (default 64KB), then after that it would apply the compression/encoding logic and then serialize the block. That led us to the development works proposed on HBASE-27264/HBASE-27486, where we added pluggable "predicate" implementations for deciding on block size boundaries. The default still uses the uncompressed raw size, but we have also added the PreviousBlockCompressionRatePredicator, which keeps track of the compression ratio based on previous closed blocks to decide on an ideal size limit for blocks after compression is applied. This optimization has brought down the total number of blocks, with now compressed size being much closer to the default limit of 64KB. The table below shows a snippet of the HFilePrettyPrinter output for a given block when PreviousBlockCompressionRatePredicator was off/on: 1) Sample block size when not using the PreviousBlockCompressionRatePredicator: onDiskSizeWithoutHeader=6613 2) Sample block size when using PreviousBlockCompressionRatePredicator: onDiskSizeWithoutHeader=54299 With the PreviousBlockCompressionRatePredicator, we could see the block sizes now getting close to the configured default and the number of blocks on each RegionServer similar to what we had on our tests. As expected, that brought down heap usage and now HBase was much more stable. Eager Cache Loading As caching is essential for keeping the optimal performance of COD using cloud storage, we need to make sure that data is cached before the client-reads request it. This requires us to prefetch all of the dataset into the cache upfront. It's also necessary to make sure newly ingested data is also cached. HBase already provided built-in features to achieve these requirements, we just needed to enable those by default on COD's configuration template: hbase.rs.cachecompactedblocksonwrite - we set this to true so that compaction processes already eagerly place its newly created blocks into the bucket cache. hbase.rs.prefetchblocksonopen and cloudera.hbase.rs.scan.prefetchblocksonopen - these two are set to false and true, respectively, to enable the "cache prefetch" process. This is a background thread that runs whenever a reader is open for a given hfile (for example, when a region is open), reading and caching each block of the file. hbase.rs.cacheblocksonwrite - This is set to true to allow flushes to eagerly cache the blocks of type DATA (see above explanation about block types) for the newly ingested data. hfile.block.index.cacheonwrite - This is set to true to allow flushes to eagerly cache the blocks of type INDEX (this a METADATA subtype, see above explanation about block types) for the newly ingested data. hfile.block.bloom.cacheonwrite - This is set to true to allow flushes to eagerly cache the blocks of type BLOOM (this a METADATA subtype, see above explanation about block types) for the newly ingested data. hbase.block.data.cachecompressed - This is set to true to skip decompression of blocks during the prefetch process so that data blocks are cached compressed (if compression is in use). METADATA blocks would always be cached uncompressed. However, the above can only be effective if there's space available in the cache to accommodate the blocks. As the dataset can grow over time, COD has also implemented autoscaling based on cache usage. This is calculated as the following function: N = F / ( C - ( T * R)) Where: N: The total number of worker nodes (RSes) we should have for optimal cache usage F: The current total store file size (the dataset size) C: The configured cache size on each RS (in COD, this would be 1.6TB) T: The max number of concurrent compaction threads R: The max region size COD autoscale monitoring periodically checks the total store file size metric value and applies it to the above function. If the optimal number of nodes ("N", in the above formula) is larger than the cluster's current number of RSes, it triggers a scale-up event. Cache Resilience Original BucketCache implementation is always volatile. Even if it's deployed over a local file, the cache is still reset in the case of RegionServer restarts. This is because the in-memory index for the blocks in the cache file is completely lost once the RegionServer JVM process is terminated. This is a problem when we are aiming to cache the whole dataset. Even with our prefetch solution described above, loading very large datasets would take a couple of hours, making any maintenance operation that requires a restart a bigger challenge. That's now sorted by the Persistent BucketCache feature, implemented by the Cloudera team and contributed back to the Apache HBase project via HBASE-27686/HBASE-27743. Persistent BucketCache is enabled by specifying a local file path as the value for the "hbase.bucketcache.persistent.path" property, in the hbase-site.xml configuration. The feature is disabled by default. Once enabled, a background thread runs in the RegionServers, periodically checking for any changes in the cache state, serializing the in-memory cache index to the file specified by "hbase.bucketcache.persistent.path". On RegionServer startup, the persistent cache file is read and deserialized. To guarantee that the actual cache is consistent with the recovered index, the following verifications are performed: Cache file checksum: When the persister thread is writing the cache index to disk, it calculates a checksum based on the cache file path, size and last modification time. This checksum is saved within the index. When we retrieve the index on a restart, the checksum is calculated again and checked against the checksum we had saved before. If the cache wasn't modified after we last saved the index, these checksums will match and we are good to go. Otherwise, the index is not up to date with the cache, it might contain some pointers to blocks that are no longer in the cache, and we need to perform the extra validation below. Block cached time: Whenever we write a block in the cache, we record the caching time as the first eight bytes in the cache offset for the given block. This info is also in the index that is saved in the persistent cache file. If the checksum verification above fails, we then need to iterate through all the index and compare each block cached time in the index against the related cached time in the cache itself. If those differ, we remove the block from the index. This is a costly operation if we consider a 1.6TB cache and needs to be performed in the background, or the RegionServer startup could be delayed for a while. And since we allow for the RegionServer to come online before we have completed the verification, we also perform the cache time check on individual cache reads, just in case a client request reaches a block in the index that has not been checked yet in the background thread. Region Balancing On an active cluster, data growth is expected and HBase will keep resharding the tables by creating new regions. To keep region distribution even among all RegionServers in the cluster, HBase's Region Balancer periodically monitors the regions/RS rate, read/write load, and store file size, among other metrics, moving regions around Region Servers to keep the cluster balanced. Whilst ensuring servers don't get overloaded (hotspotted) is critical for optimal performance, moving regions to new servers and temporarily losing the cached data can be equally impacting. Disabling the balancer is an option, but it adds an extra operational cost, as it requires constant monitoring of regions distribution and custom/manual region moving processes. To address that, the Cloudera team has designed a new region balancer implementation that is aware of cache allocation on region servers, when defining a new assignment plan. This has been contributed back to the Apache HBase project as the CacheAwareLoadBalancer in HBASE-27389 and has been detailed in this previous article. Conclusion The use of HBase BucketCache backed by a file on the ephemeral storage volumes commonly offered by all major cloud providers has shown good performance results if the cache allocation is optimal. There's still the caveat of having to plan for a cache warm-up period once the cluster is initialized. To minimize this condition, we have made the cache resilient to restarts, so the cache loading is not needed during maintenance restarts or even upon random region server crashes. Similarly, the cache-aware balancer reduces the impacts of region moves by the balancer, taking into account how much a region is cached on a region server when deciding where it should be opened. With the whole of the dataset cached locally, client requests get better performance by avoiding the overhead latency involved with cloud storage access.
... View more
Labels: