Support Questions

cjervis · ‎10-16-2016

Hi Guys,

I was looking for some information about implementation of the S3 bucket as primary storage for HDFS. Has someone done sth like that? What are pro and cons of such solution?

Thanks,

Andrzej

weichiu · ‎10-16-2016

Hi,

I don't think that's possible given that most applications are based on HDFS semantics (strong consistency, POSIX compatible), and S3 simply isn't designed as a file system (eventual consistency, blob store). Plus, you lose data locality.

As far as I know, most cloud use cases still use HDFS as temporary, intermediate storage, and use S3 as permanent, eventual storage.

There've been several studies in using HDFS as meta store, and cloud as data store, but that's a huge work (see HDFS-9806) and probably in the Hadoop 4/CDH 7 timeframe.

Hope this helps.

View solution in original post

Harsh J · ‎10-16-2016

Do you mean "Use of S3 instead of HDFS?" which would be a good idea for some cloud-env clusters such as those Cloudera Director helps run.

Keep an eye out for our upcoming 5.9 release too, where several further Cloud environment enhancements (incl. better S3 support) are forthcoming.

andrzej_jedrzej · ‎10-16-2016

I would like to ingest all my data into S3 and make it a primary storage layer (not a backup). It would be a cloud-based env, e.g. deploy within Cloudera Director. Is it possible to specify during deploying a type of storage?

I would like to run YARN, SPARK, OOZIE jobs.

weichiu · ‎10-16-2016

Hi,

I don't think that's possible given that most applications are based on HDFS semantics (strong consistency, POSIX compatible), and S3 simply isn't designed as a file system (eventual consistency, blob store). Plus, you lose data locality.

As far as I know, most cloud use cases still use HDFS as temporary, intermediate storage, and use S3 as permanent, eventual storage.

There've been several studies in using HDFS as meta store, and cloud as data store, but that's a huge work (see HDFS-9806) and probably in the Hadoop 4/CDH 7 timeframe.

Hope this helps.

Cloudera Community

Support Questions

AWS S3 bucket as a primary storage for HDFS

Comparing Performance of Cloudera Operational Data...

HDP 2.4.0 and Spark 1.6.0 connecting to AWS S3 buc...

Get files recursively from S3 bucket

Listing AWS S3 buckets

Integrating Apache NiFi with AWS S3 and SQS

Delete File inside AWS S3 bucket path

access amazon S3 bucket from hdfs

How to copy HDFS file to AWS S3 Bucket? hadoop di...

Trouble Connecting to Isilon S3 bucket w Impala an...

Ozone S3 with AWS credentials