Support Questions

Find answers, ask questions, and share your expertise

Why can't Object Stores like Amazon S3 be used as the fs.defaultFS?

avatar
New Contributor

I've read https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_cloud-data-access/content/intro.html where it is not recommended to use a cloud storage connector as the filesystem for HDFS. Can someone point out the reasoning for why these object stores can't be set as the defaultFS, which services wouldn't work/have issues, etc.?

1 ACCEPTED SOLUTION

avatar
Super Guru

Blob stores do not have the same semantics as file systems. HBase relies on very specific semantics with respect to concurrency and atomic operations which most blob stores (including S3) do not provide.

One example: a move of some "directory" in an S3 bucket is not atomic whereas this is atomic in HDFS.

HBase will 100% not work correctly if you try to configure hbase.rootdir to use S3 via the S3A adapter in Hadoop. EMR has proprietary code in their S3 filesystem access layer, unique from S3A, which does not suffer from this issue somehow.

View solution in original post

3 REPLIES 3

avatar
Super Guru

Blob stores do not have the same semantics as file systems. HBase relies on very specific semantics with respect to concurrency and atomic operations which most blob stores (including S3) do not provide.

One example: a move of some "directory" in an S3 bucket is not atomic whereas this is atomic in HDFS.

HBase will 100% not work correctly if you try to configure hbase.rootdir to use S3 via the S3A adapter in Hadoop. EMR has proprietary code in their S3 filesystem access layer, unique from S3A, which does not suffer from this issue somehow.

avatar
New Contributor

Thanks @Josh Elser for your response. I did notice HBase Master failing to stay up when the cluster was using a Blob Store (Amazon S3 and DellEMC's ECS) as the default FileSystem, which might be because HBase needs HDFS to replicate WAL. Do you know of other services that would not work in such use case?

avatar
Super Guru

I would start by assuming that no service which relies on HDFS can simply use S3 directly. S3Guard can likely bridge the gap for most systems (HBase is an exception), but I cannot tell you the requirements for every service in existence.