I am using the AWS Quick Start to install a cluster. I would like to use EBS storaged attached to my Nodes to create persistent data storage across instance restarts. This is an education cluster and I do not want to leave it running all the time but also I do not want to loose my data in HDFS when I stop the instances.
Is this supported?
As of Director 2.1, Director cannot attach EBS volumes to instances and hence cannot setup clusters that use EBS for persistent storage. EBS support is on Director's roadmap, but until then, have you considered storing your data in S3 and having your cluster read/write directly from S3?
Thank you for a prompt reply. Can I connect the EBS myself using AWS? I can then just stop and start the nodes as usual and they would have the EBS already attached.
I will alo look into using S3.
My recommendation would be to write a bootstrap script that mounts the EBS volumes. Include this bootstrap script in each instance group, so that the disks are available before the CDH services are initiatlized. If the volumes are mounted after the services are setup, you would have to reconfigure and reinitialize a bunch of CDH services. Do note that you are going off the well-trodden path - and that leveraging S3 for now will probably save you some grief.
Why is this off the beaten track? As I understand S3 is not really suitable for creating an HDFS filesystem platform. I need volumes and filesystems attached to the nodes (as you would do with nodes and BODS in an onsite installation) so that the data persists between cluster restarts. Otherwise the whole concept of using AWS for a cluster is a poor solution surely? If you have to back up all your data stored on ephemeral storage before you stop the images and then restore it once the images are restarted then thats a very clumsy soultion. Or am I missunderstanding a fundamental concept?
The suggestion here is not to use S3 to populate HDFS - as you have gathered, that would involve moving data back and forth. It isn't necessary for your use-case, though some folks do indeed do just that if their raw data is in S3 and they need to use HDFS for latency requirements.
However, the suggestion here is to have your processing jobs (MR, Impala, etc.) run directly off of S3. See this blogpost as an example: http://blog.cloudera.com/blog/2016/08/analytics-and-bi-on-amazon-s3-with-apache-impala-incubating/
The "off the beaten track" was referring to using the bootstrap script to mount EBS volumes. This is extending Director's existing functionality, which should come out of the box soon. But until then, I'm not aware of anyone else who has written such a bootstrap script. Hence the warning.
Hope this helps.