Support Questions

Find answers, ask questions, and share your expertise

HDP on Cloud (Azure, AWS) - Storage options and data movement into those

avatar
Expert Contributor

What are the storage options possible when deploying HDP on Cloud?

My understanding as follows,

1. Azure (HDInsight, HDP via CloudBreak, HDP in the MarketPlace)

WASB - What about parallelism here? i.e. if i store a file here and run a map reduce job processing this file. Would i achieve the same effect as i achieve in HDFS storage?

ADLS - Although not co-located, performance can be improved by means of parallelism.

HDFS itself - I can move the data to the edge node then copy into HDFS

What are my options to move my data into WASB, ADLS? This thread suggests NI-FI but my requirement is ephemeral and NIFI investment may not sell.

2. AWS (Below questions apply to HDCloud, HDP via CloudBreak to AWS)

S3 - What about parallelism here? i.e. if i store a file here and run a map reduce job processing this file. Would i achieve the same effect as i achieve in HDFS storage?

HDFS itself - I can move the data to the edge node then copy into HDFS

And out of these storage options, which one is better over the other and for what reason?

1 ACCEPTED SOLUTION

avatar
@learninghuman

Microsoft recently announced the general availability of ADLS, which is their exabyte scale data storage and management offering in Azure. Hortonworks recently delivered the work to certify Azure HDInsight (HDI) 3.5 based on HDP 2.5 against ADLS. This means customers can choose between using WASB or ADLS as their storage underneath HDI. Both scenarios can be fully supported by Microsoft. However, ADLS is not currently supported by Hortonworks as a storage option for HDP deployed on Azure IaaS. Only VHD's and WASB are currently supported by Hortonworks as storage options for HDP deployed on Azure IaaS today.

Hortonworks is also at the center of the Hadoop performance work being done on AWS and S3. We have done some of the work to offer parallelism for S3, but today this is only offered through Hortonworks Data Cloud for AWS (HDC), as it is not part of Core Hadoop 2 (which HDP is currently based on). Hortonworks has backported some of the performance work they've done in Core Hadoop 3 around S3 to the HDC offering.

Full support for ADLS, as well as S3, are planned in Core Hadoop 3. Referring you back to my earlier post, you can see that as part of HADOOP-12878, the community is striving to offer consistent parallelism across both cloud storage options (and potentially others) through some planned extensibility within HDFS itself. HDP will move to Core Hadoop 3 only after it is deemed stable by the Apache Software Foundation, likely within the next year or so. Until then, Cloudbreak (which deploys HDP across different cloud providers, and is separate from both HDI and HDC) will support VHD's and WASB for deployment of HDP on Azure IaaS and attached storage (ephemeral or EBS) for deployment of HDP on AWS.

View solution in original post

9 REPLIES 9

avatar

@learninghuman

As you pointed out, Object Stores are inherently not co-located. What Microsoft and Amazon do is attack this at the software layer by overriding certain java classes in core Hadoop. There is a great discussion of this related to an active Hadoop Common JIRA titled "Impersonate hosts in s3a for better data locality handling": HADOOP-12878:

Azure's implementation involves a config setting fs.s3a.block.location.impersonatedhost where the user can enter the list of hostnames in the cluster to return to getFileBlockLocations. What WASB does differently from S3A right now is that it overrides getFileBlockLocations to mimic the concept of block size and use that block size to divide a file and report that it has multiple block locations. For something like MapReduce, that translates to multiple input splits, more map tasks and a greater opportunity for I/O parallelism on jobs that consume a small number of very large files. S3A is different in that it inherits the getFileBlockLocations implementation from the superclass, which always reports that the file has exactly 1 block location (localhost). That could mean that for example, S3A would experience a bottleneck on a job whose input is a single very large file, because it would get only 1 input split. Use of the same host name in every block location can cause scheduling bottlenecks at the ResourceManager.

So, to answer your question: "out of these storage options, which one is better over the other and for what reason?" -- the answer right now would be WASB because of the problem mentioned above. However, it is important to note that even WASB is exposed to this same problem if the same host name is returned in every block location. Finally, you can see that this JIRA is about making this override part of the core Hadoop -- so that S3A, WASB and any other file system could call it to get the benefits.

Note: If not interested in using Apache NiFi for moving data into these Cloud Storage options, both WASB and S3A have their own, proprietary ways of moving data in. If moving the data from HDFS as a source, both can be targets for Distcp.

Beyond the improvements to core Hadoop above, perhaps the best way to achieve performance with Cloud Storage today is to use Hive LLAP. LLAP provides a hybrid execution model which consists of a long-lived daemon replacing direct interactions with the HDFS DataNode and a tightly integrated DAG-based framework. Functionality such as caching, pre-fetching, some query processing and access control are moved into the daemon. Small / short queries are largely processed by this daemon directly, while any heavy lifting will be performed in standard YARN containers. Similar to the DataNode, LLAP daemons can be used by other applications as well, especially if a relational view on the data is preferred over file-centric processing. The daemon is also open through optional APIs (e.g., InputFormat) that can be leveraged by other data processing frameworks as a building block (such as Apache Spark). Last, but not least, fine-grained column-level access control – a key requirement for mainstream adoption of Hive and Spark – fits nicely into this model. See the recent Hortonworks blog for more information: "SparkSQL, Ranger, and LLAP via Spark Thrift Server for BI Scenarios to provide Row and Column-level ...".

The diagram below shows an example execution with LLAP. Tez AM orchestrates overall execution. The initial stage of the query is pushed into LLAP. In the reduce stage, large shuffles are performed in separate containers. Multiple queries and applications can access LLAP concurrently.

10799-llap-diagram.png

avatar

@stevel for additional comments / corrections to what I've stated here.

avatar
Expert Contributor

@Tom McCuch Thanks. Can you also please talk a little bit about ADLS? Do you still recommend WASB over ADLS?

And i am not clear on the parallelism factor on s3 and WASB. Are you saying that S3 does not offer parallelism and suitable for larger number of smaller files? whats you take on parallelism when it comes to WASB?

And can i use WASB, ADLS and S3 when i install HDP on Azure's IaaS (using CloudBreak) as the HDFS layer?

avatar
@learninghuman

Microsoft recently announced the general availability of ADLS, which is their exabyte scale data storage and management offering in Azure. Hortonworks recently delivered the work to certify Azure HDInsight (HDI) 3.5 based on HDP 2.5 against ADLS. This means customers can choose between using WASB or ADLS as their storage underneath HDI. Both scenarios can be fully supported by Microsoft. However, ADLS is not currently supported by Hortonworks as a storage option for HDP deployed on Azure IaaS. Only VHD's and WASB are currently supported by Hortonworks as storage options for HDP deployed on Azure IaaS today.

Hortonworks is also at the center of the Hadoop performance work being done on AWS and S3. We have done some of the work to offer parallelism for S3, but today this is only offered through Hortonworks Data Cloud for AWS (HDC), as it is not part of Core Hadoop 2 (which HDP is currently based on). Hortonworks has backported some of the performance work they've done in Core Hadoop 3 around S3 to the HDC offering.

Full support for ADLS, as well as S3, are planned in Core Hadoop 3. Referring you back to my earlier post, you can see that as part of HADOOP-12878, the community is striving to offer consistent parallelism across both cloud storage options (and potentially others) through some planned extensibility within HDFS itself. HDP will move to Core Hadoop 3 only after it is deemed stable by the Apache Software Foundation, likely within the next year or so. Until then, Cloudbreak (which deploys HDP across different cloud providers, and is separate from both HDI and HDC) will support VHD's and WASB for deployment of HDP on Azure IaaS and attached storage (ephemeral or EBS) for deployment of HDP on AWS.

avatar

@learninghuman

If these answers are helpful, please don't forget to Accept the top one for me!

Thanks and Happy New Year!

_Tom

avatar
Expert Contributor

@Tom McCuch So to summarize, please correct as appropriate

1. HDI 3.5 - WASB and ADLS

2. Pre HDI 3.5 - Only WASB

3. HDP on Asure IaaS - Only WASB and HDFS on VHD

4. HDP from Azure Marketplace - Only WASB and HDFS on VHD

5. HDCloud 2.5 - S3 Only

6. HDP on AWS IaaS - HDFS on Ephemeral or EBS

avatar

@learninghuman

Yes, this is correct as of today. The next major release of HDP (3) will provide support for ADLS and S3 - so if you get started now with either HDI 3.5 or HDC 2.5, you aren't locking yourself into those PaaS offerings long-term. Cloudbreak / HDP will continue to offer you cloud portability.

avatar
Expert Contributor

@Tom McCuch One last question which i got after reading your answer again. WASB in Azure is supported on both HDP on Azure IaaS and HDP in Azure MarketPlace. Does this mean that WASB is natively optimized in Hadoop 2.x? If so, this would also mean that any distribution with Hadoop 2.x deployed on Azure can use WASB for storage?

avatar

@learninghuman

You can read more about Hadoop Azure Support: Azure Blob Storage in the Apache Doc for Hadoop 2.7.2. You'd need to check with the vendors behind the other distros to see whether or not they support this or not.