Created on 01-07-2016 07:49 PM - edited 09-16-2022 02:56 AM
Curious to know what do folks think about using HDInsight as a production cluster.
Requirements:
- daily ETL from MSSQL server from customer site into HDInsight
- pulling data into Hive
- transformations on data -- so far possible though HiveQL and UDF
- connection to Tableau
Created 01-08-2016 05:58 PM
Some additional things to consider:
In general, HDInsight might be best suited for a targeted workload where you fire up a temporary cluster do your analysis and then take it down. For completeness, I should mention that HDInsight does have a tiny local DFS but that is to store temporary files created during MR runs.
Created 01-08-2016 01:04 PM
@rmian There are a lot of good reasons for using HDI as a production cluster. These include
Keep in mind HDI is a subset of the complete Hortonworks distribution so you won't get all the bells and whistles but you'll get many of the key components you'll need.
All the items you listed are capable in HDI.
Created 01-08-2016 05:58 PM
Some additional things to consider:
In general, HDInsight might be best suited for a targeted workload where you fire up a temporary cluster do your analysis and then take it down. For completeness, I should mention that HDInsight does have a tiny local DFS but that is to store temporary files created during MR runs.
Created 03-02-2016 01:32 PM
I wanted to know couple of things here.
1) Suppose I've few map reduce jobs and they need to be run on the HDI. What I understand from HDI approach, it is for build, run and delete. If I've placed all my jars, oozie jobs, configurations on the cluster and if I delete them today. In future if I want to run the same batch job, do I need to copy all the jars, re configure the oozie jobs?
2) Is it possible to configure Solr run on HDInsights?
Created 01-11-2016 09:02 PM
Once you're on Azure you have access to Azure Data Factory (ADF) and Azure Data Lake Store (ADLS) as well. ADF is a workflow orchestration tool that works with on-premises and cloud data pipelines. ADLS is an HDFS in the cloud storage solution that enables data sharing via Azure Active Directory authentication across multiple Hadoop clusters and other HDFS-compliant computes.