Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What are file system options for HBase deployment?

avatar
Contributor

I am aware that HDFS is the common filesystem for HBase deployments, but I would like to understand the other filesystem options and possible drawbacks.

1 ACCEPTED SOLUTION

avatar
Super Guru

@Boris Demerov

The most common filesystem used with HBase is HDFS. But you are not locked into HDFS because the FileSystem used by HBase has a pluggable architecture and can be used to replace HDFS with any other supported system. In fact, you could go as far as implementing your own filesystem—maybe even on top of another database.

The local filesystem actually bypasses Hadoop entirely, that is, you do not need to have a HDFS or any other cluster at all. It is handled all in the FileSystem class used by HBase to connect to the filesystem implementation. The supplied ChecksumFileSystem class is loaded by the client and uses local disk paths to store all the data.The beauty of this approach is that HBase is unaware that it is not talking to a distributed filesystem on a remote or colocated cluster, but actually is using the local filesystem directly. The standalone mode of HBase uses this feature to run HBase only (without HDFS).

The S3 FileSystem implementation provided by Hadoop supports three different modes: the raw (or native) mode, the block-based mode, and the newer AWS SDK based mode. The raw mode uses the s3n: URI scheme and writes the data directly into S3, similar to the local filesystem. You can see all the files in your bucket the same way as you would on your local disk.The s3: scheme is the block-based mode and was used to overcome S3’s former maximum file size limit of 5 GB. This has since been changed, and therefore the selection is now more difficult—or easy: opt for s3n: if you are not going to exceed 5 GB per file.The block mode emulates the HDFS filesystem on top of S3. It makes browsing the bucket content more difficult as only the internal block files are visible, and the HBase storage files are stored arbitrarily inside these blocks and strewn across them. Both these filesystems share the fact that they use the external JetS3t open source Java toolkit to do the actual heavy lifting. A more recent addition is the s3a: scheme that replaces the JetS3t block mode with an AWS SDK based one. It is closer to the native S3 API and can optimize certain operations, resulting in speed ups, as well as integrate better overall compared to the existing implementation.

There are other filesystems, and one to mention is QFS, the Quantcast File System. It is an open source, distributed, high-performance filesystem written in C++, with similar features to HDFS. Find more information about it at the Quantcast website.There are other file systems, for example the Azure filesystem, or the Swift filesystem. Both use the native APIs of Microsoft Azure Blob Storage and OpenStack Swift respectively allowing Hadoop to store data in these systems.

Please carefully evaluate what you need given a specific use-case, however, note that the majority of clusters in production today are based on HDFS. I will not iterate here the advantages of using HDFS. The primary reason HDFS is so popular is its built-in replication, fault tolerance, and scalability. Choosing a different filesystem should provide the same guarantees, as HBase implicitly assumes that data is stored in a reliable manner by the filesystem implementation. HBase has no added means to replicate data or even maintain copies of its own storage files. This functionality must be provided by the filesystem.

View solution in original post

7 REPLIES 7

avatar

@Boris Demerov

Apart from HDFS. HBase users can also use the local filesystem by specifying the "hbase.rootdir" to point to the local filesystem in the "hbase-site.xml"

Example: http://hbase.apache.org/0.94/book/quickstart.html

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/someuser/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/someuser/zookeeper</value>
  </property>
</configuration>

.

avatar
Super Guru

@Boris Demerov

The most common filesystem used with HBase is HDFS. But you are not locked into HDFS because the FileSystem used by HBase has a pluggable architecture and can be used to replace HDFS with any other supported system. In fact, you could go as far as implementing your own filesystem—maybe even on top of another database.

The local filesystem actually bypasses Hadoop entirely, that is, you do not need to have a HDFS or any other cluster at all. It is handled all in the FileSystem class used by HBase to connect to the filesystem implementation. The supplied ChecksumFileSystem class is loaded by the client and uses local disk paths to store all the data.The beauty of this approach is that HBase is unaware that it is not talking to a distributed filesystem on a remote or colocated cluster, but actually is using the local filesystem directly. The standalone mode of HBase uses this feature to run HBase only (without HDFS).

The S3 FileSystem implementation provided by Hadoop supports three different modes: the raw (or native) mode, the block-based mode, and the newer AWS SDK based mode. The raw mode uses the s3n: URI scheme and writes the data directly into S3, similar to the local filesystem. You can see all the files in your bucket the same way as you would on your local disk.The s3: scheme is the block-based mode and was used to overcome S3’s former maximum file size limit of 5 GB. This has since been changed, and therefore the selection is now more difficult—or easy: opt for s3n: if you are not going to exceed 5 GB per file.The block mode emulates the HDFS filesystem on top of S3. It makes browsing the bucket content more difficult as only the internal block files are visible, and the HBase storage files are stored arbitrarily inside these blocks and strewn across them. Both these filesystems share the fact that they use the external JetS3t open source Java toolkit to do the actual heavy lifting. A more recent addition is the s3a: scheme that replaces the JetS3t block mode with an AWS SDK based one. It is closer to the native S3 API and can optimize certain operations, resulting in speed ups, as well as integrate better overall compared to the existing implementation.

There are other filesystems, and one to mention is QFS, the Quantcast File System. It is an open source, distributed, high-performance filesystem written in C++, with similar features to HDFS. Find more information about it at the Quantcast website.There are other file systems, for example the Azure filesystem, or the Swift filesystem. Both use the native APIs of Microsoft Azure Blob Storage and OpenStack Swift respectively allowing Hadoop to store data in these systems.

Please carefully evaluate what you need given a specific use-case, however, note that the majority of clusters in production today are based on HDFS. I will not iterate here the advantages of using HDFS. The primary reason HDFS is so popular is its built-in replication, fault tolerance, and scalability. Choosing a different filesystem should provide the same guarantees, as HBase implicitly assumes that data is stored in a reliable manner by the filesystem implementation. HBase has no added means to replicate data or even maintain copies of its own storage files. This functionality must be provided by the filesystem.

avatar
Explorer

Thanks, @Constantin Stanca and @jss.

avatar
Super Guru

Beware that S3 presently does not support the level of "sync durability" that is required by HBase write-ahead log files. This means that there would likely be data loss if you tried to run HBase fully on S3.

avatar
Contributor

Thanks all! Appreciate the detail.

avatar
Guru

HBase in HDP is certified to run on top of HDFS, WASB/ADLS (using Azure HDInsight) and EMC Isilon.

avatar
Expert Contributor

I'd like to add a point of clarity for Dell EMC Isilon.

In Isilon's OneFS operating system we have implemented the server (NameNode, DataNode) half of HDFS. A solution with Isilon still requires client systems (eg, CentOS) that have the HDFS Client installed. HBase would be installed on those systems with the HDFS client.

HBase cannot be installed on OneFS and run in standalone mode. It could be installed in standalone mode on another system that used a mount or other path to store data on OneFS. But that would not be the same as running HBase on OneFS.