About tmccuch

tmccuch · ‎01-02-2017

@Vivek Sharma Amazon Virtual Private Cloud (Amazon VPC) enables you to provision a virtual private cloud (VPC), an isolated area within AWS where you can configure a virtual network, controlling aspects such as private IP address ranges, subnets, routing tables, and network gateways. HDCloud requires a VPC, and is therefore limited to the AWS private cloud. From the Network and Security section of the current Hortonworks Data Cloud documentation:In addition to the Amazon EC2 instances created for the cloud controller and cluster nodes, Hortonworks Data Cloud deploys the following network and security AWS resources on your behalf: An Amazon VPC configured with a public subnet: When deploying the cloud controller, you have two options: (1) you can specify an existing VPC, or (2) have the cloud controller create a new VPC. Each cluster is launched into a separate subnet. For more information, see Security documentation. An Internet gateway and a route table (as part of VPC infrastructure): An Internet gateway is used to enable outbound access to the Internet from the control plane and the clusters, and a route table is used to connect the subnet to the Internet gateway. For more information on Amazon VPC architecture, see AWS documentation. Security groups: to control the inbound and outbound traffic to and from the control plane instance. For more information, see Security documentation. IAM instance roles: to hold the permissions to create certain resources. For more information, see Security documentation. If using your own VPC, make sure that: The subnet specified when creating a controller or cluster exists within the specified VPC. Your VPC has an Internet gateway attached. Your VPC has a route table attached. The route table includes a rule that routes all traffic (0.0.0.0/0) to the Internet gateway. This routes all subnet traffic that isn't between the instances within the VPC to the Internet over the Internet gateway. Since the subnets used by HDC must be associated with a route table that has a route to an Internet gateway, they are referred to as Public subnets. Because of this, the system is configured by default to restrict inbound network traffic to a minimal set of ports. The following security groups are created automatically: The CloudbreakSecurityGroup security group is created when launching your cloud controller and is associated with your cloud controller instance. By default, this group enables HTTP (80) and HTTPS (443) access to the Cloud UI and SSH access from the remote locations specified as "Remote Access" CloudFormation parameter. The ClusterNodeSecurityGroupmaster security group is created when you create a cluster and is associated with all Master node(s). By default, this group enables SSH access from the remote locations specified as "Remote Access" parameter when creating the cluster. The ClusterNodeSecurityGroupworker security group is created when you create a cluster and is associated with all Worker node(s). By default, this group enables SSH access from the remote locations specified as "Remote Access" parameter when creating the cluster. See the Ports section of the Security documentation for information about additional ports that may be opened on these groups.

tmccuch · ‎01-02-2017

@Vivek Sharma Yes, you can use CloudFormation to deploy HDP in AWS IaaS. In fact, we use CloudFormation as well as other AWS services within Hortonworks Data Cloud for AWS (HDC) today: Amazon EC2 is used to launch virtual machines. Amazon VPC is used to provision your own dedicated virtual network and launch resources into that network. AWS Identity & Access Management is used to control access to AWS services and resources. AWS CloudFormation is used to create and manage a collection of related AWS resources. AWS Lambda is a utility service for running code in AWS. This service is used when deploying the cloud controller into a new VPC to validate if the VPC and subnet specified exist and if the subnet belongs to that VPC. Amazon S3 provides secure, durable, and highly scalable cloud storage. Amazon RDS provides a relational database in AWS. This service is used for managing reusable, shared Hive Metastores and as a configuration option when launching the cloud controller. With a formal Hortonworks Subscription in force, Hortonworks will support any HDP cluster that was provisioned through Ambari, regardless of how that provisioning process was scripted. If using our Hortonworks Data Cloud Controller and HDP Services sold through the Amazon Marketplace, then Hortonworks provides and supports the CloudFormation scripts as well. Save yourself some time, and check out HDC first!

tmccuch · ‎01-02-2017

@learninghuman To state it most simply, auto-scaling is a capability of Cloudbreak only at this point in time. With Cloudbreak Periscope, you can define a scaling policy and apply it to any Alert on any Ambari Metric. Scaling granularity is at the Ambari host group level. This provides you the option to scale services or components only, not the whole cluster. Per your line of questioning above, if you use Cloudbreak to provision HDP on either Azure IaaS or AWS IaaS, you can use the auto-scaling capabilities it provides. Both Azure HDInsight (HDI) and Hortonworks Data Cloud for AWS (HDC) make it very easy to manually re-size your cluster through their respective consoles. Auto-scaling is not a feature of either offering at this point in time. In regards to data re-balancing, neither HDI nor HDC need to be concerned with this, because they are both automatically configured to use Cloud Storage (currently ADLS and S3 respectively) - not HDFS. For HDP deployed on IaaS with Cloudbreak, auto-scaling may potentially perform a HDFS rebalance - but only after a Downscale operation. In order to keep a healthy HDFS during downscale, Cloudbreak always keeps the replication factor configured and makes sure that there is enough space on HDFS to rebalance data. During downscale, in order to minimize the rebalancing, replication, and HDFS storms, Cloudbreak checks block locations and computes the least costly operations.

tmccuch · ‎12-30-2016

@learninghuman You can read more about Hadoop Azure Support: Azure Blob Storage in the Apache Doc for Hadoop 2.7.2. You'd need to check with the vendors behind the other distros to see whether or not they support this or not.

tmccuch · ‎12-28-2016

@learninghuman Yes, this is correct as of today. The next major release of HDP (3) will provide support for ADLS and S3 - so if you get started now with either HDI 3.5 or HDC 2.5, you aren't locking yourself into those PaaS offerings long-term. Cloudbreak / HDP will continue to offer you cloud portability.

tmccuch · ‎12-27-2016

@learninghuman If these answers are helpful, please don't forget to Accept the top one for me! Thanks and Happy New Year! _Tom

tmccuch · ‎12-27-2016

@learninghuman Microsoft recently announced the general availability of ADLS, which is their exabyte scale data storage and management offering in Azure. Hortonworks recently delivered the work to certify Azure HDInsight (HDI) 3.5 based on HDP 2.5 against ADLS. This means customers can choose between using WASB or ADLS as their storage underneath HDI. Both scenarios can be fully supported by Microsoft. However, ADLS is not currently supported by Hortonworks as a storage option for HDP deployed on Azure IaaS. Only VHD's and WASB are currently supported by Hortonworks as storage options for HDP deployed on Azure IaaS today. Hortonworks is also at the center of the Hadoop performance work being done on AWS and S3. We have done some of the work to offer parallelism for S3, but today this is only offered through Hortonworks Data Cloud for AWS (HDC), as it is not part of Core Hadoop 2 (which HDP is currently based on). Hortonworks has backported some of the performance work they've done in Core Hadoop 3 around S3 to the HDC offering. Full support for ADLS, as well as S3, are planned in Core Hadoop 3. Referring you back to my earlier post, you can see that as part of HADOOP-12878, the community is striving to offer consistent parallelism across both cloud storage options (and potentially others) through some planned extensibility within HDFS itself. HDP will move to Core Hadoop 3 only after it is deemed stable by the Apache Software Foundation, likely within the next year or so. Until then, Cloudbreak (which deploys HDP across different cloud providers, and is separate from both HDI and HDC) will support VHD's and WASB for deployment of HDP on Azure IaaS and attached storage (ephemeral or EBS) for deployment of HDP on AWS.

tmccuch · ‎12-25-2016

@stevel for additional comments / corrections to what I've stated here.

tmccuch · ‎12-25-2016

@learninghuman As you pointed out, Object Stores are inherently not co-located. What Microsoft and Amazon do is attack this at the software layer by overriding certain java classes in core Hadoop. There is a great discussion of this related to an active Hadoop Common JIRA titled "Impersonate hosts in s3a for better data locality handling": HADOOP-12878: Azure's implementation involves a config setting fs.s3a.block.location.impersonatedhost where the user can enter the list of hostnames in the cluster to return to getFileBlockLocations. What WASB does differently from S3A right now is that it overrides getFileBlockLocations to mimic the concept of block size and use that block size to divide a file and report that it has multiple block locations. For something like MapReduce, that translates to multiple input splits, more map tasks and a greater opportunity for I/O parallelism on jobs that consume a small number of very large files. S3A is different in that it inherits the getFileBlockLocations implementation from the superclass, which always reports that the file has exactly 1 block location (localhost). That could mean that for example, S3A would experience a bottleneck on a job whose input is a single very large file, because it would get only 1 input split. Use of the same host name in every block location can cause scheduling bottlenecks at the ResourceManager. So, to answer your question: "out of these storage options, which one is better over the other and for what reason?" -- the answer right now would be WASB because of the problem mentioned above. However, it is important to note that even WASB is exposed to this same problem if the same host name is returned in every block location. Finally, you can see that this JIRA is about making this override part of the core Hadoop -- so that S3A, WASB and any other file system could call it to get the benefits. Note: If not interested in using Apache NiFi for moving data into these Cloud Storage options, both WASB and S3A have their own, proprietary ways of moving data in. If moving the data from HDFS as a source, both can be targets for Distcp. Beyond the improvements to core Hadoop above, perhaps the best way to achieve performance with Cloud Storage today is to use Hive LLAP. LLAP provides a hybrid execution model which consists of a long-lived daemon replacing direct interactions with the HDFS DataNode and a tightly integrated DAG-based framework. Functionality such as caching, pre-fetching, some query processing and access control are moved into the daemon. Small / short queries are largely processed by this daemon directly, while any heavy lifting will be performed in standard YARN containers. Similar to the DataNode, LLAP daemons can be used by other applications as well, especially if a relational view on the data is preferred over file-centric processing. The daemon is also open through optional APIs (e.g., InputFormat) that can be leveraged by other data processing frameworks as a building block (such as Apache Spark). Last, but not least, fine-grained column-level access control – a key requirement for mainstream adoption of Hive and Spark – fits nicely into this model. See the recent Hortonworks blog for more information: "SparkSQL, Ranger, and LLAP via Spark Thrift Server for BI Scenarios to provide Row and Column-level Security, and Masking". The diagram below shows an example execution with LLAP. Tez AM orchestrates overall execution. The initial stage of the query is pushed into LLAP. In the reduce stage, large shuffles are performed in separate containers. Multiple queries and applications can access LLAP concurrently.

tmccuch · ‎12-22-2016

@Warren Tracey Did this answer your question? If so, please accept the answer. If not, I'll be happy to answer any other questions you have. Thanks! _Tom

Online	Offline
Last Visited	‎01-15-2021 12:43 PM

Member Since	‎10-08-2015 02:48 PM
Last Visited	‎01-15-2021 12:43 PM
Posts	87
Kudos received	136

Cloudera Community

Re: What is the best way to secure S3A objects on ...

Re: What are the cluster-wide bandwidth limitation...

Re: spark 2.1.0 Reading *.gz files from an s3 buck...

Re: HDP install in AWS VPC, on custom AMI -- feasi...

Re: HDP on Mesos using Marathon, Docker

Re: Can HDCloud be created only in VPC?

Re: CloudFormation to deploy HDP in AWS IaaS

Re: Scaling and Auto-scaling of HDP on AWS and Azu...

Re: HDP on Cloud (Azure, AWS) - Storage options an...

Re: HDP on Cloud (Azure, AWS) - Storage options an...

Re: HDP on Cloud (Azure, AWS) - Storage options an...

Re: HDP on Cloud (Azure, AWS) - Storage options an...

Re: HDP on Cloud (Azure, AWS) - Storage options an...

Re: HDP on Cloud (Azure, AWS) - Storage options an...

Re: Unified BI Semantic Layer