Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HDInsight Vs HDP Service on Azure Vs HDP on Azure IaaS

avatar
Expert Contributor

I see 3 different options to deploy HDP on Hadoop

  1. HDInsights (Built on top of HDP)
  2. HDP as a Service
  3. Deploying HDP on Azure's bare metal

In my understanding 1 and 2 are managed services where the control is limited when it comes to the choice of OS etc. HDInsight has multiple cluster types (not sure whats the rationale behind this though)

Questions:

  1. Whats the rationale behind having multiple cluster types for HDInsight?
  2. Why are two services (1 and 2 above) offered? When to use what? (apart from this)
  3. Are there any performance benchmarks done on HDInsight or HDP on Azure in a production situation?
  4. What are the different storage types possible on the above services? Atleast on HDInsight i see that Blob storage and Data Lake Store are options but both are external to the compute nodes. May hit performance, hence curious about question 3 apart from the fact that the cluster run on the virtual machines.
  5. What are the option to provision HDP on Azure bare metal nodes (Option 3)? Does CloudBreak help there?
1 ACCEPTED SOLUTION

avatar
Guru

Note that there are 3 cloud deploy options for HDP -- you are missing HDC Hortonworks Data Cloud. Also, your #2 is probably best called HDP IaaS so that it is not confused with managed services.

HDP via Cloudbreak (alternatively via Azure Marketplace for Azure)

This is a full deployment of whatever cluster configurations you want to deploy to either AWS, Azure, Google or OpenStack. This is pure IaaS. It is like having your full on-prem possibilities in the cloud. Hortonworks tool Cloudbreak helps provisioning of the IaaS instances and their management including autoscaling. Blueprints help select preconfigured cluster types but you can use whatever server instances and cluster configurations (number of masternodes, datanodes, how HDP service are distributed across them) you wish. It deploys all HDP services via Ambari as you would on prem. You can use HDFS or S3, Blob etc as storage, or a combination. It is meant for long-running ("permanent") clusters. For Azure, you can alternatively deploy your full cluster via the Azure Marketplace instead of using Cloudbreak.

HDInsight

This is PaaS or managed services on Azure. It includes most, but not all HDP services (e.g. has HBase, Storm, Spark but currently does not have Atlas) and also has R Server on Spark, and these are managed as Azure services engineered by Microsoft and Hortonworks to conform to the Azure platform. It has a set of preconfigured cluster types to make spinning up a cluster easy and fast. It can use HDFS, ADLS or WASB storage. It is meant for both long-running and ephemeral clusters (spin down when work is completed). For ephemeral storage data is persisted in ADLS or WASB and Hive metadata is persisted so that you can spin up a new cluster and pick up data and hive state from any previous work.

Hortonworks Data Cloud (HDC)

This was released by Hortonworks in Oct 2016. It is AWS only and provisioned via AWS marketplace (takes just minutes) and offers a set of preconfigured clusters focused on the core use cases of data prep, ETL, data warehousing, data science and analytics. As such, it includes only HDP services around these use cases (e.g. no Storm). It is meant for ephemeral workloads where you frequently spin down and then spin up and thus pay-as-you-go. Data is persisted in S3 and Hive metadata persists behind the scenes so you can pick up state after spinning down/up. It is a very cost-effective and rapid way to do Big Data. Example use cases are: Imagine daily feeds from your on-prem cluster that get sent to HDC to process for one hour to tune a model deployed in production app. After processing it is turned off and this happens daily. Or imagine a data science project where the cluster is spun down each day when no one is working on it. HDC is optimized for S3 so processing is much faster than it is for AWS EMR. HDC is self-managed (IaaS as preconfigured HDP clusters).

Regarding your questions:

1) Rationale for multiple cluster types is to preconfigure clusters so you simply select which one you want and quickly get up and running (vs designing/configuring them yourself).

2) The two Azure options are HDP Iaas on Azure via Cloudbreak (self-managed) and HDInsights (managed). Differences are explained in summary above but main contrast is HDInsight is more focused on Azure integration, managed services, hourly pricing option and ease of deployment as well as both long-running and ephemeral workloads. As your link states, the HDInsight managed service model means that support is first to Microsoft (which escalates to Hortonworks if it is an HDP issue and not an Azure service layer issue).

3) Difficult to benchmark directly -- you scale in the cloud to meet your performance needs.

4) HDP Iaas Azure and HDInsight support both WASB and ADLS blob storage. Regarding performance, you scale to meet your needs.

5) HDP IaaS is deployed to virtual cloud via Cloudbreak tool, which as mentioned is very useful in provisioning the cluster (workflow, blueprints, autoscaling, etc).

View solution in original post

6 REPLIES 6

avatar
Guru

Note that there are 3 cloud deploy options for HDP -- you are missing HDC Hortonworks Data Cloud. Also, your #2 is probably best called HDP IaaS so that it is not confused with managed services.

HDP via Cloudbreak (alternatively via Azure Marketplace for Azure)

This is a full deployment of whatever cluster configurations you want to deploy to either AWS, Azure, Google or OpenStack. This is pure IaaS. It is like having your full on-prem possibilities in the cloud. Hortonworks tool Cloudbreak helps provisioning of the IaaS instances and their management including autoscaling. Blueprints help select preconfigured cluster types but you can use whatever server instances and cluster configurations (number of masternodes, datanodes, how HDP service are distributed across them) you wish. It deploys all HDP services via Ambari as you would on prem. You can use HDFS or S3, Blob etc as storage, or a combination. It is meant for long-running ("permanent") clusters. For Azure, you can alternatively deploy your full cluster via the Azure Marketplace instead of using Cloudbreak.

HDInsight

This is PaaS or managed services on Azure. It includes most, but not all HDP services (e.g. has HBase, Storm, Spark but currently does not have Atlas) and also has R Server on Spark, and these are managed as Azure services engineered by Microsoft and Hortonworks to conform to the Azure platform. It has a set of preconfigured cluster types to make spinning up a cluster easy and fast. It can use HDFS, ADLS or WASB storage. It is meant for both long-running and ephemeral clusters (spin down when work is completed). For ephemeral storage data is persisted in ADLS or WASB and Hive metadata is persisted so that you can spin up a new cluster and pick up data and hive state from any previous work.

Hortonworks Data Cloud (HDC)

This was released by Hortonworks in Oct 2016. It is AWS only and provisioned via AWS marketplace (takes just minutes) and offers a set of preconfigured clusters focused on the core use cases of data prep, ETL, data warehousing, data science and analytics. As such, it includes only HDP services around these use cases (e.g. no Storm). It is meant for ephemeral workloads where you frequently spin down and then spin up and thus pay-as-you-go. Data is persisted in S3 and Hive metadata persists behind the scenes so you can pick up state after spinning down/up. It is a very cost-effective and rapid way to do Big Data. Example use cases are: Imagine daily feeds from your on-prem cluster that get sent to HDC to process for one hour to tune a model deployed in production app. After processing it is turned off and this happens daily. Or imagine a data science project where the cluster is spun down each day when no one is working on it. HDC is optimized for S3 so processing is much faster than it is for AWS EMR. HDC is self-managed (IaaS as preconfigured HDP clusters).

Regarding your questions:

1) Rationale for multiple cluster types is to preconfigure clusters so you simply select which one you want and quickly get up and running (vs designing/configuring them yourself).

2) The two Azure options are HDP Iaas on Azure via Cloudbreak (self-managed) and HDInsights (managed). Differences are explained in summary above but main contrast is HDInsight is more focused on Azure integration, managed services, hourly pricing option and ease of deployment as well as both long-running and ephemeral workloads. As your link states, the HDInsight managed service model means that support is first to Microsoft (which escalates to Hortonworks if it is an HDP issue and not an Azure service layer issue).

3) Difficult to benchmark directly -- you scale in the cloud to meet your performance needs.

4) HDP Iaas Azure and HDInsight support both WASB and ADLS blob storage. Regarding performance, you scale to meet your needs.

5) HDP IaaS is deployed to virtual cloud via Cloudbreak tool, which as mentioned is very useful in provisioning the cluster (workflow, blueprints, autoscaling, etc).

avatar
Expert Contributor

@Greg Keys Thanks a lot. Few follow up questions

1. Option 2 that i was talking about is what i see in the Azure portal. Please see the attachments. hdponazure.png and hdponazure-clustercreation.png

2. What about the "Data Lake store" as an option for storage on all options?

3. With respect to performance, my question was more around the issues due to compute and storage not colocated.

4. And what is the purpose of HDCoud? Is it similar to CloudBreak for AWS? Is it for HDP on AWS IaaS?

5. And HDC that you mentioned above - is that a HDP as a service Offering from AWS?

avatar
Guru

@learninghuman

Thank you for your clarification on #2 -- I have updated my answer to reflect accordingly.

1) Yes, that is HDP IaaS on Azure. There are two ways to deploy this -- via Hortonworks Cloudbreak tool and Azure Marketplace (your screenshots in comment) https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_HDP_AzureSetup/content/ch_HDP_AzureSetup...

2) That is referred to as ADLS in my answer and is available to HDP IaaS Azure and HDInsights. Data Lake Store is designed to be Azures form of a data lake compatible with HDFS and processed via Hadoop technology. https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview

3) Compute and storage are not colocated so yes there is a cost to traveling across the wire. As the link above describes, ADLS is distributed so it improves on read/write performance through parallelization. Note that decoupling compute and storage in the cloud is typically seen as an advantage since you can scale each separately (compute=expensive, storage=cheap, compute/storage=expensive)

4) HDCloud should be thought of much differently than HDP via Cloudbreak to AWS. HDCloud is deployed and billed via AWS marketplace (thus pay-as-you-go i.e by the minute) and is meant for ephemeral workloads -- ones you want to turn on/off to save cost whereas HDP should be seen as a full on prem deployment but in the cloud. HDCloud is very focused on core use cases of data prep, ETL, data warehouse, data science, analytics. It is a subset of full HDP services. HDCloud clusters are preconfigured and can be spun up in mere minutes (the IaaS and HDP layers are deployed in one shot via preconfigured clusters). HDP via Cloudbreak is full cluster deployment (whatever config you want) with the IaaS deployment assisted and managed via Cloudbreak and the HDP deployment via Ambari. In a nutshell -- HDP via Cloudbreak is full control for long-running full featured clusters and HDCloud are pre-packaged easy deploy clusters to be used with the intent of spinning them up/down frequently to save money on a pay by the minute basis.

5) HDC is not a managed service like HDInsights. Once you quickly get HDC up, you are completely inside the HDP world and not the service world. HDC is IaaS, but just quick prepackages clusters thar are easy to deploy.

avatar
Expert Contributor

@Greg Keys Thanks again. Hopefully last set of questions

1. With HDP in Azure marketplace, we cannot use the OS of our choice. With CloudBreak, can we specify the OS?

2. Storage in Azure - HDFS, WSAB, ADLS are options for all deployment options of HDP IaaS (CloudBreak, Marketplace), HDInsights?

3. With HDC can i choose the OS?

4. What are the storage options for HDCloud? Is it HDFS and S3 (same as that for HDP on AWS IaaS through CloudBreak)?

5. Can i deploy HDP via CloudBreak in AWS VPC similar to the way that i can deploy in the AWS public cloud?

6. Can i deploy HDC on AWS VPC?

7. What are my options to move data from on-premise to AWS public cloud (S3, HDFS) and AWS VPC (S3, HDFS)? (This may not be strictly HDP question!)

8. What are my options to move data from on-premise to Azure public cloud (WASB, ADLS, HDFS) ?

9. Can i spin HDInsights or HDP (Cloudbreak or marketplace) in Azure private cloud? (I assume that Azure offers two flavors of private cloud - on-premise hosted and the other one similar to VPC)

avatar
Guru

1) No, the host instances that Cloudbreak creates in the cloud infrastructure uses the following default base images: Amazon Web Services: RHEL 7.1 Microsoft Azure: CentOS 7.1 OpenLogic Google Cloud Platform: CentOS 7: centos-7-v20150603 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_cldbrk_install/bk_CLBK_IAG/content/ch02s...

2) see *

3) No, you get Amazon Linux 2016.09 https://aws.amazon.com/amazon-linux-ami/

4) see *

5-6) **

7-8) We strongly recommend NiFi which has native connectors and is very easy/fast to develop, deploy, operate: http://hortonworks.com/apache/nifi/

9) See **

* I recommend posting these as a separate question around Storage Options for HDP in the cloud

** I recommend posting these as separate question around VPC Deployment of HDP in the cloud

The reason I am suggesting this is because I am not the best to answer this and the questions will get better exposure and thus will be a better benefit to you and the community.

avatar
Explorer

May I 1 question Please.. @gkeys 

If I buy HDInsight as PAAS, what will be role and responsibility as Hadoop Admin, or Admin job role will be removed?

As we can't upgrade Hadoop versions, service will be 1 click ready. what else remaining? Performance tuning can be done by developer directly.. Hope you understand my worrying concern...