Reading Reference Architecture for AWS Deployments, a message is very clear, use EC2 instance storage for data paths.
In deep, which paths absolutely need this kind of storage? Only HDFS datanode storage paths (dfs.data.dir, dfs.datanode.data.dir) or something else? Our EC2 instances are EBS backed, so software/logs/etc will be stored on EBS. EC2 instances also map ephemeral storage devices (mounted one-to-one on filesystems) on worker nodes.
I don't understand why EC2 instances with instance storage on board are suggested for management nodes. Other store paths like hadoop namenode datapath or something else needs to be stored on ephemeral?
EC2 ephemeral instance storage provides the highest throughput to DFS. Historically, it was a clear choice for most Hadoop users.
With Amazon's release of new EBS volume types (ST1 and SC1), the AWS Reference Architecture has been updated to allow for DFS on EBS. Specifically, the DataNode data directories can be placed on ST1 or SC1 EBS volumes. The new volumes don't provide the same throughput as ephemeral, but they are suitable for many workloads. In addition, using DFS on EBS allows for variety of EC2 instance types with varying performance characteristics and pricing.
For now, we continue to recommend that DFS NameNode data continue to be stored on EC2 ephemeral storage.