Created on 10-17-2015 06:02 PM - edited 09-16-2022 02:44 AM
Some customers opt to set up HDP on Azure IaaS instead of HDInsight. They may still need to persist data in the Azure Blob Store. If you have setup HDP on Azure IaaS with WASB for storage please share your setup instructions and related best practices so all can benefit.
Created 10-17-2015 08:33 PM
The main thing that needs to be done to enable access to WASB is to configure the credentials. This involves editing core-site.xml, and the instructions are documented in Apache here:
http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
After configuring the credentials, you'll be able to access WASB files by specifying URLs that use "wasb" as the scheme. HDInsight clusters also make WASB the default file system by setting configuration property fs.defaultFS to a wasb URL. The same Apache documentation shows usage examples for both cases.
Note that setting fs.defaultFS is only done if your intention is to replace HDFS fully as the default file system. To my knowledge, this has only ever been done in HDInsight.
Created 10-17-2015 08:30 PM
This blog talks about installing HDP in Azure (IaaS) + Configure WASB ( with screen shots)
Created 10-17-2015 08:33 PM
The main thing that needs to be done to enable access to WASB is to configure the credentials. This involves editing core-site.xml, and the instructions are documented in Apache here:
http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
After configuring the credentials, you'll be able to access WASB files by specifying URLs that use "wasb" as the scheme. HDInsight clusters also make WASB the default file system by setting configuration property fs.defaultFS to a wasb URL. The same Apache documentation shows usage examples for both cases.
Note that setting fs.defaultFS is only done if your intention is to replace HDFS fully as the default file system. To my knowledge, this has only ever been done in HDInsight.
Created 10-19-2015 01:40 PM
Here are some details on setting up WASB as the default filesystem (fs.defaultFS) using HDP 2.3.x on Azure IaaS:
Architecture Considerations
Before using WASB as the defaultFS, it is important to understand the architectural impact this will have on the cluster.
What is WASB? + Pros
Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs that interfaces with data stored within an Azure Blob Storage account. One of the key advantages of using WASB is that it creates a layer of abstraction that enables separation of storage from compute. You can also add/remove/modify files in the Azure blob store without regard to the Hadoop cluster, as directory and file references are checked at runtime for each job. This allows you to better utilise the flexibility of a cloud deployment, as data can persist without the need for a cluster / compute nodes. You have the following deployment options when using WASB:
In many ways you can interface with the WASB filesystem as you would with HDFS, by specifying the WASB URLs:
wasb://<containername>@<accountname>.blob.core.windows.net/<path>
For example, the following hadoop FileSystem Shell command to WASB would behave similar to the same call made to HDFS:
hadoop fs -ls wasb://<containername>@<accountname>.blob.core.windows.net/
However, it is important to note that WASB and HDFS are separate filesystems and WASB currently has some limitations that should be considered before implementing.
Current Limitations - As at October 2015
The following limitations currently exist within the WASB filesystem:
When deciding to use WASB as the defaultFS, it is recommended that you first review and assess your security and data access requirements and ensure you can meet them against the above limitations.
Installation + Configuration Steps
Important: There is a known bug in HDP 2.3.0 that prevents non HDFS filesystems (S3, WASB etc) being set as the default filesystem [HADOOP-11618, HADOOP-12304]. For the following steps to work, you will need to use either a patched version of HDP 2.3 or use HDP 2.3.2 or later.
To configure WASB as the default filesystem in-place of HDFS, the following Azure information is required:
This information is used along with the Hadoop configurations within core-site.xml to configure WASB.
For a fresh install, follow the standard documentation for installing HDP via Ambari found here until you get to Customize Services, where you will need to configure your WASB properties. Also note that when using WASB as the defaultFS, you do not need to mount any additional data drives to the servers (as you would for a HDFS cluster).
Customize Services
The following is a list of configurations that should be modified to configure WASB:
wasb://<containername>@<accountname>.blob.core.windows.net
org.apache.hadoop.fs.azure.Wasb
<storage_access_key>
/mnt/resource/Hadoop/hdfs/data
Outside of these core-site.xml configurations, hive has the following requirements when working with blob storage on Azure:
wasb:///hdp/apps/${hdp.version}/hive/hive.tar.gz wasb:///hdp/apps/${hdp.version}/pig/pig.tar.gz wasb:///hdp/apps/${hdp.version}/sqoop/sqoop.tar.gz wasb:///hdp/apps/${hdp.version}/mapreduce/hadoop-streaming.jar
true
Further Reading
Created 01-30-2018 01:03 PM
Hello,
Could you please assist me in answering the following query of mine:
Regards,
Subhankar
Created 10-23-2015 05:33 PM
One thing to keep in mind is that SmartSense will collect core-site.xml so if you have WASB access keys, they will be collected, so to opt-out you need to disable the properties you don't want collected.
Created 10-26-2015 06:28 PM
just went through it yesterday, steps here.
Created 12-09-2015 10:01 PM
If you also want to look at the performance remember to put Storage Account in the same region than IaaS deployment. You need to remenber Azure limits for the IaaS: do not put more than 2 DataNodes on the same Storage Account (for max IOPS limits).
Look also at Azure Data Lake Store if you plan to use a storage for both IaaS and PaaS (HDInsight).
Created 02-03-2016 05:38 PM
How about use of DASH? Cloudbreak suggests DASH with WASB.
Created 05-11-2016 09:33 AM
Hi, does this support Transparent Data Encryption? How does the Disaster Recovery/Copy to another region Azure Blob storage work?