Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Architecting an CDH cluster on AWS for some PBs of data

avatar
Explorer

Hello,

I have some doubts about a deployment of CDH on AWS.

I read the reference architecture doc and other material I found on Cloudera Engineering Blog but I need some more insights about it.

 

 

1) Is the CDH deployment available only for some kind of instaces or I can deploy it on all the AWS instance types?

 

2) Assuming I want to create a cluster that will be active 24x7. For a long-running cluster I understood it's better to have a cluster based on local-storage instances.

If we consider a cluster of 2PBs I think that d2.8xlarge should be the best choice for the datanodes.

About the Master Nodes:

- if I want to deploy only 3 Master Nodes, is it better to have them as local-storage instances too or as EBS attached instances to be able to react quickly to a possible Master Node failure?

- are there some best practice about the master node instance type (EBS or local-storage)?

 About the Data Nodes:

- if a data node fails, Has the CDH some automated mechanism to automaticly spin-up a new instance and connect it to the cluster in order to restore the cluster without down-times? Have we to build a script from scratch to do this thing?

About the Edge Nodes:

- are there some best practice about the instance type (EBS or local-storage)?

 

 

3) If I want to do a backup of the cluster on S3:

- when I do a distcp from the CDH to S3, can I move the data directly on Glacier instead of the normal S3?

If I have some compression applied on the data (e.g. snappy, gzip, etc.) and I do a distcp to S3:

- Is the space occupied on S3 the same or the distcp command decompress the data for the copy?

 

If I have a cluster based on EBS attached instances:

- is it possible to snapshot the disks and re-attach a datanode based on the snapshot?

 

4) If I have the Data Nodes deploy as r4.8xlarge and I need more horsepower, is it possible to scale-up the cluster from r4.8xlarge to a r4.16xlarge on-the-fly? Attaching and detaching the disks in few mins?

 

Thanks a lot for the clarifications, I hope my doubts will help also other users. 🙂 

 

Best Regards,

Luca

 

 

 

 

 

 

 

1 ACCEPTED SOLUTION

avatar
Rising Star

Hi Luca, I'm not sure if your questions are are directed towards using Cloudera Director to deploy CDH or just general CDH deployment on AWS. Some of these answers are tailored towards using Director to deploy CDH on AWS.

 

1) Is the CDH deployment available only for some kind of instaces or I can deploy it on all the AWS instance types?

 

Most AWS instance types should work, but be sure to choose instances types with enough compute and memory based on the number of services being deployed, otherwise you will run into health warnings / errors in CM and things may not work as expected. The reference architecture for AWS Deployments which you may have already looked at makes some recommendations on choosing instances for Master, Worker and Edge nodes.

 

2) Assuming I want to create a cluster that will be active 24x7. For a long-running cluster I understood it's better to have a cluster based on local-storage instances.

If we consider a cluster of 2PBs I think that d2.8xlarge should be the best choice for the datanodes.

About the Master Nodes:

- if I want to deploy only 3 Master Nodes, is it better to have them as local-storage instances too or as EBS attached instances to be able to react quickly to a possible Master Node failure?

 

The main purpose of EBS volumes in Director is to give a wider range of storage types (gp2, st1, sc1) and to allow pausing the Cluster for cost saving purposes. If quickly reacting to Master Node failures is a priority the cluster should be set up with High Availability

 

- are there some best practice about the master node instance type (EBS or local-storage)?

 

Both should be viable, no additional reccomendations aside from what's in the reference architecture

 

 About the Data Nodes:

- if a data node fails, Has the CDH some automated mechanism to automaticly spin-up a new instance and connect it to the cluster in order to restore the cluster without down-times? Have we to build a script from scratch to do this thing?

 

To clarify CDH isn't capable of spinning up a new instance on AWS, but Director is. If an instance that has a data node fails, Director will not automatically spin-up a new instance. The user can go through the Director UI and choose the repair option for the failing instance. This will provision a new data node instance in it's place and add it to the CDH cluster. Repair can also be done through Director API so this can be automated if needed.

 

About the Edge Nodes:

- are there some best practice about the instance type (EBS or local-storage)?

 

Both should be viable, no additional reccomendations aside from what's in the reference architecture

 

3) If I want to do a backup of the cluster on S3:

- when I do a distcp from the CDH to S3, can I move the data directly on Glacier instead of the normal S3?

 

I don't think distcp supports Glacier as a destination. You should be able to use S3 lifecycle policies to send an object from S3 to Glacier some number of days after the object is created. So it's true that distcp can't go directly to Glacier, but a simple data flow through S3 is possible.

 

If I have some compression applied on the data (e.g. snappy, gzip, etc.) and I do a distcp to S3:

- Is the space occupied on S3 the same or the distcp command decompress the data for the copy?

 

I don't think distcp will decompress the data.

 

If I have a cluster based on EBS attached instances:

- is it possible to snapshot the disks and re-attach a datanode based on the snapshot?

 

This workflow is currently not supported.

 

4) If I have the Data Nodes deploy as r4.8xlarge and I need more horsepower, is it possible to scale-up the cluster from r4.8xlarge to a r4.16xlarge on-the-fly? Attaching and detaching the disks in few mins?

 

This workflow is also currently not supported.

View solution in original post

2 REPLIES 2

avatar
Rising Star

Hi Luca, I'm not sure if your questions are are directed towards using Cloudera Director to deploy CDH or just general CDH deployment on AWS. Some of these answers are tailored towards using Director to deploy CDH on AWS.

 

1) Is the CDH deployment available only for some kind of instaces or I can deploy it on all the AWS instance types?

 

Most AWS instance types should work, but be sure to choose instances types with enough compute and memory based on the number of services being deployed, otherwise you will run into health warnings / errors in CM and things may not work as expected. The reference architecture for AWS Deployments which you may have already looked at makes some recommendations on choosing instances for Master, Worker and Edge nodes.

 

2) Assuming I want to create a cluster that will be active 24x7. For a long-running cluster I understood it's better to have a cluster based on local-storage instances.

If we consider a cluster of 2PBs I think that d2.8xlarge should be the best choice for the datanodes.

About the Master Nodes:

- if I want to deploy only 3 Master Nodes, is it better to have them as local-storage instances too or as EBS attached instances to be able to react quickly to a possible Master Node failure?

 

The main purpose of EBS volumes in Director is to give a wider range of storage types (gp2, st1, sc1) and to allow pausing the Cluster for cost saving purposes. If quickly reacting to Master Node failures is a priority the cluster should be set up with High Availability

 

- are there some best practice about the master node instance type (EBS or local-storage)?

 

Both should be viable, no additional reccomendations aside from what's in the reference architecture

 

 About the Data Nodes:

- if a data node fails, Has the CDH some automated mechanism to automaticly spin-up a new instance and connect it to the cluster in order to restore the cluster without down-times? Have we to build a script from scratch to do this thing?

 

To clarify CDH isn't capable of spinning up a new instance on AWS, but Director is. If an instance that has a data node fails, Director will not automatically spin-up a new instance. The user can go through the Director UI and choose the repair option for the failing instance. This will provision a new data node instance in it's place and add it to the CDH cluster. Repair can also be done through Director API so this can be automated if needed.

 

About the Edge Nodes:

- are there some best practice about the instance type (EBS or local-storage)?

 

Both should be viable, no additional reccomendations aside from what's in the reference architecture

 

3) If I want to do a backup of the cluster on S3:

- when I do a distcp from the CDH to S3, can I move the data directly on Glacier instead of the normal S3?

 

I don't think distcp supports Glacier as a destination. You should be able to use S3 lifecycle policies to send an object from S3 to Glacier some number of days after the object is created. So it's true that distcp can't go directly to Glacier, but a simple data flow through S3 is possible.

 

If I have some compression applied on the data (e.g. snappy, gzip, etc.) and I do a distcp to S3:

- Is the space occupied on S3 the same or the distcp command decompress the data for the copy?

 

I don't think distcp will decompress the data.

 

If I have a cluster based on EBS attached instances:

- is it possible to snapshot the disks and re-attach a datanode based on the snapshot?

 

This workflow is currently not supported.

 

4) If I have the Data Nodes deploy as r4.8xlarge and I need more horsepower, is it possible to scale-up the cluster from r4.8xlarge to a r4.16xlarge on-the-fly? Attaching and detaching the disks in few mins?

 

This workflow is also currently not supported.

avatar
Explorer

Precise and clear!

Thank you very much for your clarifications aarman!

 

 

Regards,

Luca