Created 12-21-2016 11:26 AM
I see 3 different options to deploy HDP on Hadoop
In my understanding 1 and 2 are managed services where the control is limited when it comes to the choice of OS etc. HDInsight has multiple cluster types (not sure whats the rationale behind this though)
Questions:
Created 12-21-2016 12:34 PM
Note that there are 3 cloud deploy options for HDP -- you are missing HDC Hortonworks Data Cloud. Also, your #2 is probably best called HDP IaaS so that it is not confused with managed services.
HDP via Cloudbreak (alternatively via Azure Marketplace for Azure)
This is a full deployment of whatever cluster configurations you want to deploy to either AWS, Azure, Google or OpenStack. This is pure IaaS. It is like having your full on-prem possibilities in the cloud. Hortonworks tool Cloudbreak helps provisioning of the IaaS instances and their management including autoscaling. Blueprints help select preconfigured cluster types but you can use whatever server instances and cluster configurations (number of masternodes, datanodes, how HDP service are distributed across them) you wish. It deploys all HDP services via Ambari as you would on prem. You can use HDFS or S3, Blob etc as storage, or a combination. It is meant for long-running ("permanent") clusters. For Azure, you can alternatively deploy your full cluster via the Azure Marketplace instead of using Cloudbreak.
HDInsight
This is PaaS or managed services on Azure. It includes most, but not all HDP services (e.g. has HBase, Storm, Spark but currently does not have Atlas) and also has R Server on Spark, and these are managed as Azure services engineered by Microsoft and Hortonworks to conform to the Azure platform. It has a set of preconfigured cluster types to make spinning up a cluster easy and fast. It can use HDFS, ADLS or WASB storage. It is meant for both long-running and ephemeral clusters (spin down when work is completed). For ephemeral storage data is persisted in ADLS or WASB and Hive metadata is persisted so that you can spin up a new cluster and pick up data and hive state from any previous work.
Hortonworks Data Cloud (HDC)
This was released by Hortonworks in Oct 2016. It is AWS only and provisioned via AWS marketplace (takes just minutes) and offers a set of preconfigured clusters focused on the core use cases of data prep, ETL, data warehousing, data science and analytics. As such, it includes only HDP services around these use cases (e.g. no Storm). It is meant for ephemeral workloads where you frequently spin down and then spin up and thus pay-as-you-go. Data is persisted in S3 and Hive metadata persists behind the scenes so you can pick up state after spinning down/up. It is a very cost-effective and rapid way to do Big Data. Example use cases are: Imagine daily feeds from your on-prem cluster that get sent to HDC to process for one hour to tune a model deployed in production app. After processing it is turned off and this happens daily. Or imagine a data science project where the cluster is spun down each day when no one is working on it. HDC is optimized for S3 so processing is much faster than it is for AWS EMR. HDC is self-managed (IaaS as preconfigured HDP clusters).
Regarding your questions:
1) Rationale for multiple cluster types is to preconfigure clusters so you simply select which one you want and quickly get up and running (vs designing/configuring them yourself).
2) The two Azure options are HDP Iaas on Azure via Cloudbreak (self-managed) and HDInsights (managed). Differences are explained in summary above but main contrast is HDInsight is more focused on Azure integration, managed services, hourly pricing option and ease of deployment as well as both long-running and ephemeral workloads. As your link states, the HDInsight managed service model means that support is first to Microsoft (which escalates to Hortonworks if it is an HDP issue and not an Azure service layer issue).
3) Difficult to benchmark directly -- you scale in the cloud to meet your performance needs.
4) HDP Iaas Azure and HDInsight support both WASB and ADLS blob storage. Regarding performance, you scale to meet your needs.
5) HDP IaaS is deployed to virtual cloud via Cloudbreak tool, which as mentioned is very useful in provisioning the cluster (workflow, blueprints, autoscaling, etc).
Created 12-21-2016 12:34 PM
Note that there are 3 cloud deploy options for HDP -- you are missing HDC Hortonworks Data Cloud. Also, your #2 is probably best called HDP IaaS so that it is not confused with managed services.
HDP via Cloudbreak (alternatively via Azure Marketplace for Azure)
This is a full deployment of whatever cluster configurations you want to deploy to either AWS, Azure, Google or OpenStack. This is pure IaaS. It is like having your full on-prem possibilities in the cloud. Hortonworks tool Cloudbreak helps provisioning of the IaaS instances and their management including autoscaling. Blueprints help select preconfigured cluster types but you can use whatever server instances and cluster configurations (number of masternodes, datanodes, how HDP service are distributed across them) you wish. It deploys all HDP services via Ambari as you would on prem. You can use HDFS or S3, Blob etc as storage, or a combination. It is meant for long-running ("permanent") clusters. For Azure, you can alternatively deploy your full cluster via the Azure Marketplace instead of using Cloudbreak.
HDInsight
This is PaaS or managed services on Azure. It includes most, but not all HDP services (e.g. has HBase, Storm, Spark but currently does not have Atlas) and also has R Server on Spark, and these are managed as Azure services engineered by Microsoft and Hortonworks to conform to the Azure platform. It has a set of preconfigured cluster types to make spinning up a cluster easy and fast. It can use HDFS, ADLS or WASB storage. It is meant for both long-running and ephemeral clusters (spin down when work is completed). For ephemeral storage data is persisted in ADLS or WASB and Hive metadata is persisted so that you can spin up a new cluster and pick up data and hive state from any previous work.
Hortonworks Data Cloud (HDC)
This was released by Hortonworks in Oct 2016. It is AWS only and provisioned via AWS marketplace (takes just minutes) and offers a set of preconfigured clusters focused on the core use cases of data prep, ETL, data warehousing, data science and analytics. As such, it includes only HDP services around these use cases (e.g. no Storm). It is meant for ephemeral workloads where you frequently spin down and then spin up and thus pay-as-you-go. Data is persisted in S3 and Hive metadata persists behind the scenes so you can pick up state after spinning down/up. It is a very cost-effective and rapid way to do Big Data. Example use cases are: Imagine daily feeds from your on-prem cluster that get sent to HDC to process for one hour to tune a model deployed in production app. After processing it is turned off and this happens daily. Or imagine a data science project where the cluster is spun down each day when no one is working on it. HDC is optimized for S3 so processing is much faster than it is for AWS EMR. HDC is self-managed (IaaS as preconfigured HDP clusters).
Regarding your questions:
1) Rationale for multiple cluster types is to preconfigure clusters so you simply select which one you want and quickly get up and running (vs designing/configuring them yourself).
2) The two Azure options are HDP Iaas on Azure via Cloudbreak (self-managed) and HDInsights (managed). Differences are explained in summary above but main contrast is HDInsight is more focused on Azure integration, managed services, hourly pricing option and ease of deployment as well as both long-running and ephemeral workloads. As your link states, the HDInsight managed service model means that support is first to Microsoft (which escalates to Hortonworks if it is an HDP issue and not an Azure service layer issue).
3) Difficult to benchmark directly -- you scale in the cloud to meet your performance needs.
4) HDP Iaas Azure and HDInsight support both WASB and ADLS blob storage. Regarding performance, you scale to meet your needs.
5) HDP IaaS is deployed to virtual cloud via Cloudbreak tool, which as mentioned is very useful in provisioning the cluster (workflow, blueprints, autoscaling, etc).
Created 12-21-2016 02:15 PM
@Greg Keys Thanks a lot. Few follow up questions
1. Option 2 that i was talking about is what i see in the Azure portal. Please see the attachments. hdponazure.png and hdponazure-clustercreation.png
2. What about the "Data Lake store" as an option for storage on all options?
3. With respect to performance, my question was more around the issues due to compute and storage not colocated.
4. And what is the purpose of HDCoud? Is it similar to CloudBreak for AWS? Is it for HDP on AWS IaaS?
5. And HDC that you mentioned above - is that a HDP as a service Offering from AWS?
Created 12-21-2016 03:43 PM
Thank you for your clarification on #2 -- I have updated my answer to reflect accordingly.
1) Yes, that is HDP IaaS on Azure. There are two ways to deploy this -- via Hortonworks Cloudbreak tool and Azure Marketplace (your screenshots in comment) https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_HDP_AzureSetup/content/ch_HDP_AzureSetup...
2) That is referred to as ADLS in my answer and is available to HDP IaaS Azure and HDInsights. Data Lake Store is designed to be Azures form of a data lake compatible with HDFS and processed via Hadoop technology. https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
3) Compute and storage are not colocated so yes there is a cost to traveling across the wire. As the link above describes, ADLS is distributed so it improves on read/write performance through parallelization. Note that decoupling compute and storage in the cloud is typically seen as an advantage since you can scale each separately (compute=expensive, storage=cheap, compute/storage=expensive)
4) HDCloud should be thought of much differently than HDP via Cloudbreak to AWS. HDCloud is deployed and billed via AWS marketplace (thus pay-as-you-go i.e by the minute) and is meant for ephemeral workloads -- ones you want to turn on/off to save cost whereas HDP should be seen as a full on prem deployment but in the cloud. HDCloud is very focused on core use cases of data prep, ETL, data warehouse, data science, analytics. It is a subset of full HDP services. HDCloud clusters are preconfigured and can be spun up in mere minutes (the IaaS and HDP layers are deployed in one shot via preconfigured clusters). HDP via Cloudbreak is full cluster deployment (whatever config you want) with the IaaS deployment assisted and managed via Cloudbreak and the HDP deployment via Ambari. In a nutshell -- HDP via Cloudbreak is full control for long-running full featured clusters and HDCloud are pre-packaged easy deploy clusters to be used with the intent of spinning them up/down frequently to save money on a pay by the minute basis.
5) HDC is not a managed service like HDInsights. Once you quickly get HDC up, you are completely inside the HDP world and not the service world. HDC is IaaS, but just quick prepackages clusters thar are easy to deploy.
Created 12-21-2016 04:42 PM
@Greg Keys Thanks again. Hopefully last set of questions
1. With HDP in Azure marketplace, we cannot use the OS of our choice. With CloudBreak, can we specify the OS?
2. Storage in Azure - HDFS, WSAB, ADLS are options for all deployment options of HDP IaaS (CloudBreak, Marketplace), HDInsights?
3. With HDC can i choose the OS?
4. What are the storage options for HDCloud? Is it HDFS and S3 (same as that for HDP on AWS IaaS through CloudBreak)?
5. Can i deploy HDP via CloudBreak in AWS VPC similar to the way that i can deploy in the AWS public cloud?
6. Can i deploy HDC on AWS VPC?
7. What are my options to move data from on-premise to AWS public cloud (S3, HDFS) and AWS VPC (S3, HDFS)? (This may not be strictly HDP question!)
8. What are my options to move data from on-premise to Azure public cloud (WASB, ADLS, HDFS) ?
9. Can i spin HDInsights or HDP (Cloudbreak or marketplace) in Azure private cloud? (I assume that Azure offers two flavors of private cloud - on-premise hosted and the other one similar to VPC)
Created 12-21-2016 10:29 PM
1) No, the host instances that Cloudbreak creates in the cloud infrastructure uses the following default base images: Amazon Web Services: RHEL 7.1 Microsoft Azure: CentOS 7.1 OpenLogic Google Cloud Platform: CentOS 7: centos-7-v20150603 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_cldbrk_install/bk_CLBK_IAG/content/ch02s...
2) see *
3) No, you get Amazon Linux 2016.09 https://aws.amazon.com/amazon-linux-ami/
4) see *
5-6) **
7-8) We strongly recommend NiFi which has native connectors and is very easy/fast to develop, deploy, operate: http://hortonworks.com/apache/nifi/
9) See **
* I recommend posting these as a separate question around Storage Options for HDP in the cloud
** I recommend posting these as separate question around VPC Deployment of HDP in the cloud
The reason I am suggesting this is because I am not the best to answer this and the questions will get better exposure and thus will be a better benefit to you and the community.
Created 09-02-2022 04:52 AM
May I 1 question Please.. @gkeys
If I buy HDInsight as PAAS, what will be role and responsibility as Hadoop Admin, or Admin job role will be removed?
As we can't upgrade Hadoop versions, service will be 1 click ready. what else remaining? Performance tuning can be done by developer directly.. Hope you understand my worrying concern...