Support Questions

Find answers, ask questions, and share your expertise

Using HDInsight as a production Hadoop cluster?

avatar
Contributor

Curious to know what do folks think about using HDInsight as a production cluster.

Requirements:

- daily ETL from MSSQL server from customer site into HDInsight

- pulling data into Hive

- transformations on data -- so far possible though HiveQL and UDF

- connection to Tableau

1 ACCEPTED SOLUTION

avatar
Contributor

Some additional things to consider:

  1. Cost of transporting data: Azure bills for network usage.
    1. This is not an issue, for example, if the MSSQL that you are ingesting data from is also in Azure.
  2. If data is going to live in the cluster for long, e.g. several weeks, then your best bang for buck is going to be to host it in your datacenter on bare metal.
    1. Obviously, an important argument in favor of HDInsight would be savings in terms of ease of managing the cluster. Also lack of in house speed, skill and ability to host a cluster in your DC would preclude this option.
    2. Why is that? Because it goes against the grain of a basic tenet of Hadoop: "take processing to data instead of taking processing to data". HDInsight does not store data data locally; it is stored in Azure Blob Storage. So all data must be brought to processing (from Azure cloud storage to computer nodes of the cluster).
    3. This is more important if you are doing I/O heavy processing, e.g. running data intensive MR loads like hive queries against data in DFS backed by Azure Blob Storage. For comparison, if you were running, say, a Spark load then this may not be an issue because the main bottleneck is compute not data transport.

In general, HDInsight might be best suited for a targeted workload where you fire up a temporary cluster do your analysis and then take it down. For completeness, I should mention that HDInsight does have a tiny local DFS but that is to store temporary files created during MR runs.

View solution in original post

4 REPLIES 4

avatar

@rmian There are a lot of good reasons for using HDI as a production cluster. These include

  • ease of setup
  • easily scalable
  • cost effective

Keep in mind HDI is a subset of the complete Hortonworks distribution so you won't get all the bells and whistles but you'll get many of the key components you'll need.

All the items you listed are capable in HDI.

avatar
Contributor

Some additional things to consider:

  1. Cost of transporting data: Azure bills for network usage.
    1. This is not an issue, for example, if the MSSQL that you are ingesting data from is also in Azure.
  2. If data is going to live in the cluster for long, e.g. several weeks, then your best bang for buck is going to be to host it in your datacenter on bare metal.
    1. Obviously, an important argument in favor of HDInsight would be savings in terms of ease of managing the cluster. Also lack of in house speed, skill and ability to host a cluster in your DC would preclude this option.
    2. Why is that? Because it goes against the grain of a basic tenet of Hadoop: "take processing to data instead of taking processing to data". HDInsight does not store data data locally; it is stored in Azure Blob Storage. So all data must be brought to processing (from Azure cloud storage to computer nodes of the cluster).
    3. This is more important if you are doing I/O heavy processing, e.g. running data intensive MR loads like hive queries against data in DFS backed by Azure Blob Storage. For comparison, if you were running, say, a Spark load then this may not be an issue because the main bottleneck is compute not data transport.

In general, HDInsight might be best suited for a targeted workload where you fire up a temporary cluster do your analysis and then take it down. For completeness, I should mention that HDInsight does have a tiny local DFS but that is to store temporary files created during MR runs.

avatar
Expert Contributor

I wanted to know couple of things here.

1) Suppose I've few map reduce jobs and they need to be run on the HDI. What I understand from HDI approach, it is for build, run and delete. If I've placed all my jars, oozie jobs, configurations on the cluster and if I delete them today. In future if I want to run the same batch job, do I need to copy all the jars, re configure the oozie jobs?

2) Is it possible to configure Solr run on HDInsights?

avatar
Explorer

Once you're on Azure you have access to Azure Data Factory (ADF) and Azure Data Lake Store (ADLS) as well. ADF is a workflow orchestration tool that works with on-premises and cloud data pipelines. ADLS is an HDFS in the cloud storage solution that enables data sharing via Azure Active Directory authentication across multiple Hadoop clusters and other HDFS-compliant computes.