Support Questions

Find answers, ask questions, and share your expertise

Need instructions to setup WASB as storage for HDP on Azure IaaS

avatar
Contributor

Some customers opt to set up HDP on Azure IaaS instead of HDInsight. They may still need to persist data in the Azure Blob Store. If you have setup HDP on Azure IaaS with WASB for storage please share your setup instructions and related best practices so all can benefit.

1 ACCEPTED SOLUTION

avatar

The main thing that needs to be done to enable access to WASB is to configure the credentials. This involves editing core-site.xml, and the instructions are documented in Apache here:

http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html

After configuring the credentials, you'll be able to access WASB files by specifying URLs that use "wasb" as the scheme. HDInsight clusters also make WASB the default file system by setting configuration property fs.defaultFS to a wasb URL. The same Apache documentation shows usage examples for both cases.

Note that setting fs.defaultFS is only done if your intention is to replace HDFS fully as the default file system. To my knowledge, this has only ever been done in HDInsight.

View solution in original post

10 REPLIES 10

avatar
Master Mentor

This blog talks about installing HDP in Azure (IaaS) + Configure WASB ( with screen shots)

avatar

The main thing that needs to be done to enable access to WASB is to configure the credentials. This involves editing core-site.xml, and the instructions are documented in Apache here:

http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html

After configuring the credentials, you'll be able to access WASB files by specifying URLs that use "wasb" as the scheme. HDInsight clusters also make WASB the default file system by setting configuration property fs.defaultFS to a wasb URL. The same Apache documentation shows usage examples for both cases.

Note that setting fs.defaultFS is only done if your intention is to replace HDFS fully as the default file system. To my knowledge, this has only ever been done in HDInsight.

avatar
Expert Contributor

Here are some details on setting up WASB as the default filesystem (fs.defaultFS) using HDP 2.3.x on Azure IaaS:

Architecture Considerations

Before using WASB as the defaultFS, it is important to understand the architectural impact this will have on the cluster.

What is WASB? + Pros

Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs that interfaces with data stored within an Azure Blob Storage account. One of the key advantages of using WASB is that it creates a layer of abstraction that enables separation of storage from compute. You can also add/remove/modify files in the Azure blob store without regard to the Hadoop cluster, as directory and file references are checked at runtime for each job. This allows you to better utilise the flexibility of a cloud deployment, as data can persist without the need for a cluster / compute nodes. You have the following deployment options when using WASB:

  1. HDFS as defaultFS - Deploy a HDFS cluster and configure an external WASB filesystem to interface with a separate Blob Storage account. In this scenario your default Hadoop data would still be stored on HDFS, and you would have some data stored in a Blob Storage account that can be accessed when needed.
  2. WASB as defaultFS - Deploy the cluster with WASB as the default filesystem to completely replace HDFS with WASB. In this scenario HDFS still technically exists but will be empty as all files / folders deployed with the cluster are stored in WASB (blob storage). Note, the steps below describe how to configure a cluster for this approach.

In many ways you can interface with the WASB filesystem as you would with HDFS, by specifying the WASB URLs:

wasb://<containername>@<accountname>.blob.core.windows.net/<path>

For example, the following hadoop FileSystem Shell command to WASB would behave similar to the same call made to HDFS:

hadoop fs -ls wasb://<containername>@<accountname>.blob.core.windows.net/

However, it is important to note that WASB and HDFS are separate filesystems and WASB currently has some limitations that should be considered before implementing.

Current Limitations - As at October 2015

The following limitations currently exist within the WASB filesystem:

  • WebHDFS is not compatible with WASB
    • This limits the functionality within both Hue and Ambari Views (and any application that interfaces with the WebHDFS APIs)
  • Security permissions are persisted in WASB, but are not enforced.
    • File owner and group are persisted for all directories and files, but the permission model is not enforced at the filesystem level and all authorisation occurs at the level of the entire Azure Blob Storage account.

When deciding to use WASB as the defaultFS, it is recommended that you first review and assess your security and data access requirements and ensure you can meet them against the above limitations.

Installation + Configuration Steps

Important: There is a known bug in HDP 2.3.0 that prevents non HDFS filesystems (S3, WASB etc) being set as the default filesystem [HADOOP-11618, HADOOP-12304]. For the following steps to work, you will need to use either a patched version of HDP 2.3 or use HDP 2.3.2 or later.

To configure WASB as the default filesystem in-place of HDFS, the following Azure information is required:

  • Azure Storage Account URL
  • Container Name
  • Storage Access Key

This information is used along with the Hadoop configurations within core-site.xml to configure WASB.

For a fresh install, follow the standard documentation for installing HDP via Ambari found here until you get to Customize Services, where you will need to configure your WASB properties. Also note that when using WASB as the defaultFS, you do not need to mount any additional data drives to the servers (as you would for a HDFS cluster).

Customize Services

The following is a list of configurations that should be modified to configure WASB:

  • fs.defaultFS
wasb://<containername>@<accountname>.blob.core.windows.net
  • fs.AbstractFileSystem.wasb.impl
org.apache.hadoop.fs.azure.Wasb
  • fs.azure.account.key.<accountname>.blob.core.windows.net
<storage_access_key>
  • Even though WASB will be set as the fs.defaultFS, you still need to define DataNode directories for HDFS. As the intent here is to use WASB as the primary FS, you can set the HDFS datanode directories to the temporary /mnt/resource mount point that is provided with Azure compute servers if you only plan to use HDFS for temporary job files. DataNode Directories
/mnt/resource/Hadoop/hdfs/data

Outside of these core-site.xml configurations, hive has the following requirements when working with blob storage on Azure:

  • Point archive and jar files to 'wasb://' instead of 'hdfs://' templeton.hive.archive templeton.pig.archive templeton.sqoop.archive templeton.streaming.jar
wasb:///hdp/apps/${hdp.version}/hive/hive.tar.gz
wasb:///hdp/apps/${hdp.version}/pig/pig.tar.gz
wasb:///hdp/apps/${hdp.version}/sqoop/sqoop.tar.gz
wasb:///hdp/apps/${hdp.version}/mapreduce/hadoop-streaming.jar
  • Skip azure metrics in custom webhcat-site When WASB is used, metrics collection is enabled by default. For webhcat server, this causes an unnecessary overhead that we can disable. fs.azure.skip.metrics
true

Further Reading

avatar
New Contributor

Hello,

Could you please assist me in answering the following query of mine:

https://community.hortonworks.com/questions/167906/we-are-unable-to-access-sparkspark2-when-we-chang...

Regards,

Subhankar

avatar
Master Mentor

One thing to keep in mind is that SmartSense will collect core-site.xml so if you have WASB access keys, they will be collected, so to opt-out you need to disable the properties you don't want collected.

avatar
Master Mentor

just went through it yesterday, steps here.

avatar
Rising Star

If you also want to look at the performance remember to put Storage Account in the same region than IaaS deployment. You need to remenber Azure limits for the IaaS: do not put more than 2 DataNodes on the same Storage Account (for max IOPS limits).

Look also at Azure Data Lake Store if you plan to use a storage for both IaaS and PaaS (HDInsight).

avatar

How about use of DASH? Cloudbreak suggests DASH with WASB.

avatar
New Contributor

Hi, does this support Transparent Data Encryption? How does the Disaster Recovery/Copy to another region Azure Blob storage work?