Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar

How to setup a cluster in AWS? What type of storage is supported for HDFS? EBS? EMR?

EBS is supported and recommended mainly for mission critical, that is for data that must be (mostly) available. You can do ephemeral storage, which will be faster, but if the node goes down you won’t be able to restore that data and since AWS (and other cloud providers) are known to have entire regions disappear, you can and will lose your whole cluster

EBS volumes will be available again when the region comes back online, ephemeral won’t. However EBS is also very pricy and you may not want to pay for that option.

However another option is using ephemeral storage, but setting up backup routines to S3, so you can restore back to a point in time. (If you want you can use EBS and back up with S3). I guess the main reason EBS is not recommended for HDFS also is that it is very expensive, but it is supported.

For HBASE workloads you should use i2. Only use d2 nodes for a storage density workload type (w/ sequential read), which gave you a lot of locally attached storage and the throughput is quite good.

Other Storage Tips:

  • Hs1.8xl for Hadoop with ephemeral storage.
  • I2 for hbase
  • D2.8xl for compute intensive hbase plus data intensive storage.
  • Ebs is very expensive and scaling is not so linear. Depends on how many storage array fabrics you mesh to under the covers.
  • The instance/ephemeral storage (on AWS) would only be for data node HDFS. Therefore lose of an instance is less of a concern. Its also going to get much better performance.
2,679 Views
Comments
avatar
New Contributor

Thanks for this article. A followup question on the Ephemeral storage consideration for HDFS. We use Hortonworks cluster nodes(m4.4x large). Is it recommended to have a cluster of 10 data nodes out of which 5 are ephemerals and 5 EBS backed instances?. Aassume, 5 ephemeral nodes backed up to s3. What are the pros and cons would be, especially the data loss when there is 1.Crash of few ephemeral or EBS backed instances 2. AWS outage at this az 3. Region outage(DR plan in another region with S3 cross-region replication?).