Member since
10-08-2015
87
Posts
142
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
473 | 03-02-2017 03:39 PM | |
2345 | 02-09-2017 06:43 PM | |
11110 | 02-04-2017 05:38 PM | |
1683 | 01-10-2017 10:24 PM | |
1931 | 01-05-2017 06:58 PM |
03-30-2018
04:19 PM
@Ajay - Thank you for this article! Can you please re-label this to Apache Hadoop HDFS Ozone, rather than Apache Ozone? The latter is not the proper use of Apache branding. Thanks. Tom
... View more
03-29-2017
06:01 PM
@Anandakrishnan Ramakrishnan --delete-target-dir is meant to delete the <HDFS-target-dir> provided in command before writing data to this directory. If it isn't a permissions issue, I am suspecting that it may be failing because this isn't an HDFS directory.
... View more
03-26-2017
04:38 PM
2 Kudos
@Francisco Pires Typically, for data warehousing, we recommend logically organizing your data into tiers for processing. The physical organization is a little different for everyone, but here is an example for Hive:
... View more
03-02-2017
03:39 PM
1 Kudo
@eorgad To protect the S3A access/secret keys, it is recommended that you use either: IAM role-based authentication (such as EC2 instance profile), or the Hadoop Credential Provider Framework - securely storing them and accessing them through configuration. The Hadoop Credential Provider Framework allows secure "Credential Providers" to keep secrets outside Hadoop configuration files, storing them in encrypted files in local or Hadoop filesystems, and including them in requests. The Hadoop-AWS Module documentation describes how to configure this properly.
... View more
02-09-2017
08:23 PM
1 Kudo
@Peter Coates As always, it is good to hear from you. In lieu of answering all of your questions outright (since several of them deal with Amazon Proprietary IP) ... If I helped progress your research here, I'd be very appreciative if you could Accept my answer. Thanks. Tom
... View more
02-09-2017
06:43 PM
1 Kudo
@Peter Coates I think the AWS doc is pretty clear on this: "EBS–optimized instances deliver dedicated bandwidth to Amazon EBS". How they do that is unclear - but it is safe to say that creating an overlay network (i.e. VPC, EBS-Optimized, etc.) would require the use of SDN technology. That being said, there is a pretty descriptive chart on expected bandwidth, throughput, and IOPS for Amazon EBS-Optimized Instances here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html. Regarding your observations on S3 performance, you are paying a lot less for S3 - so most of this is to be expected. However - the link I shared with you previously covers the work we are driving in the community around Hadoop-AWS Integration. This work deals with the differences between how HDFS needs to be optimized to interact with Blob Storage API's (such as S3A) going forward as opposed to how it historically interacted with Direct Attached Storage.
... View more
02-09-2017
12:38 AM
1 Kudo
@Peter Coates Because of the nature of the S3 object store, data written to an S3A OutputStream is not written incrementally —instead, by default, it is buffered to disk until the stream is closed in its close() method. This can make output slow: the further the process is from the S3 endpoint, or the smaller the EC-hosted VM is, the longer it will take work to complete. Work to address this began in Hadoop 2.7 with the S3AFastOutputStream HADOOP-11183, and has continued with S3ABlockOutputStream HADOOP-13560. With incremental writes of blocks, "S3A fast upload" offers an upload time at least as fast as the "classic" mechanism, with significant benefits on long-lived output streams, and when very large amounts of data are generated. Please see Hadoop-AWS module: Integration with Amazon Web Services for more information on Hadoop-AWS integration. On instances without support for EBS-optimized throughput, network traffic can contend with traffic between your instance and your EBS volumes. EBS–optimized instances deliver dedicated bandwidth to Amazon EBS, with options between 500 Mbps and 10,000 Mbps, depending on the instance type you use. When you enable EBS optimization for an instance that is not EBS–optimized by default, you pay an additional low, hourly fee for the dedicated capacity.
... View more
02-08-2017
02:00 PM
1 Kudo
@BigDataRocks Please let me know if this helped answer your question. Thanks. Tom
... View more
02-04-2017
05:38 PM
2 Kudos
@BigDataRocks I believe you need to escape the wildcard: val df = spark.sparkContext.textFile("s3n://..../\*.gz). Additionally, the S3N filesystem client, while widely used, is no longer undergoing active maintenance except for emergency security issues. The S3A filesystem client can read all files created by S3N. Accordingly it should be used wherever possible. Please see: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md for the s3a classpath dependencies and authentication properties you need to be aware of. A nice tutorial on this subject can be found here: https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html
... View more
01-30-2017
09:22 PM
1 Kudo
@shashi cheppela Let me know how things are going. If I've helped answer your question, I'd very much appreciate if you would recognize my effort by Accepting the answer. Thanks. Tom
... View more
01-29-2017
01:34 PM
1 Kudo
@shashi cheppela I see you are using --delete-target-dir. This attempts to delete the import target directory if it exists. Permissions on your S3 bucket may be tripping you up. I'd start by removing that. You may want to double-check the s3a classpath dependencies and authentication properties you are using. see: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md see also: http://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html
... View more
01-28-2017
02:48 PM
1 Kudo
@shashi cheppela Are you using --hive-import? If so, you may be running into this: SQOOP-3403. Workaround is to not use --hive-import with s3:// as --target-dir.
... View more
01-11-2017
06:43 PM
1 Kudo
@jhals99 Likewise for the great question! If you could accept my answer, I'd be very appreciative. Thanks. Tom
... View more
01-10-2017
10:24 PM
2 Kudos
@jhals99 I'm looking to provision HDP clusters in an AWS VPC, on Custom Hardened AMI's. Understand from the link below
that Cloudbreak w/custom AMI requires support subscription. Which level of support exactly? Cloudbreak in HDP is supported within the Enterprise Plus Subscription. Please see Support Subscriptions on hortonworks.com for more information. I have data on s3 encrypted with Server side encryption. S3A should provide seamless access to encrypted data: yes/no? Yes. In core-site.xml, you'll find the following configuration available: spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY I heard about an Amazon EMR customer using S3 (not local storage/hdfs) for an Hbase environment. I assume S3A will provide the same functionality with Hbase in HDP? No special sauce in EMR right? S3 is not recommended for HBase today. Amazon EMR supports HBase with a combination of HFiles on S3 and the WAL on ephemeral HDFS – this configuration can have data loss in the face of failures. Our HBase team is aware of this and there are a couple other options we are exploring such as HFile on S3 and WAL on EBS / EFS – this is still under investigation. EC2 local disks have to be LUKS encrypted. That should not matter to cloudbreak/hdp right? Correct, Hadoop uses the operating system to read disk. Disk-level encryption is below the operating system and therefore, is not a concern for Cloudbreak/HDP. We require s3a role based authentication. Access to s3 files from Hive, Pig, MR, Spark etc. My question is: Does this work out of the box with the latest release of HDP? Yes. Please refer to the HCC Article: How to Access Data Files stored in AWS S3 Buckets using HDFS / Hive / Pig for more information.
... View more
01-07-2017
05:20 PM
1 Kudo
@learninghuman I hope this answered your questions. If so, please remember to accept my answer. Thank you! _Tom
... View more
01-06-2017
10:49 PM
1 Kudo
@learninghuman
To help clarify, all of the data access components within HDP run on YARN. We view Mesos as one of the many alternatives for IaaS within the private cloud space (Openstack, VMware, etc.). Our aim is to support them all and provide our customers both connectivity and portability across them with HDF and HDP.
... View more
01-05-2017
06:58 PM
2 Kudos
@learninghuman Cloudbreak currently offers a Technical Preview for deployment on Mesos. At a high level, Cloudbreak deployment on Mesos is similar to Cloudbreak implementations on other cloud providers: HDP clusters are provisioned through Ambari with the help of blueprints, and Ambari server and agents run in Docker containers. However, there are some important differences with other cloud providers:
Cloudbreak expects a "bring your own Mesos" infrastructure, which means that you have to deploy Mesos first and then configure access to the existing Mesos deployment in Cloudbreak. The Cloudbreak Mesos integration was designed not to include steps to first build the infrastructure as it does for other cloud environments, such as: creating or reusing the networking layer (virtual networks, subnets, and so on), provisioning new virtual machines in these networks from pre-existing cloud images, and starting docker containers on these VMs (nodes). From a YARN perspective, there is no difference between Mesos and other public / private clouds. YARN manages the compute capacity provided to it by the IaaS layer.
... View more
01-05-2017
02:52 PM
1 Kudo
@Mahen Jay Perhaps the best way to test real-time issues from a student labs perspective is to dig into Ambari and HDP Upgrade. This is covered (with labs) in our HDP Operations: Hadoop Administration 2 course. See the HWU Training Catalog for more details.
... View more
01-03-2017
05:49 PM
@learninghuman If this answer helps, please accept it. Otherwise, I'd be happy to answer any remaining questions you have.
Thanks! _Tom
... View more
01-03-2017
05:48 PM
@Vivek Sharma If this answer helps, please accept it. Otherwise, I'd be happy to answer any remaining questions you have.
Thanks! _Tom
... View more
01-03-2017
05:47 PM
@Vivek Sharma If this answer helps, please accept it. Otherwise, I'd be happy to answer any remaining questions you have.
Thanks! _Tom
... View more
01-03-2017
05:46 PM
@Vivek Sharma If this answer helps, please accept it. Otherwise, I'd be happy to answer any remaining questions you have.
Thanks! _Tom
... View more
01-02-2017
09:46 PM
2 Kudos
@Vivek Sharma With Cloudbreak Periscope, you can define a scaling policy and apply it to any Alert on any Ambari Metric. Scaling granularity is at the Ambari host group level. This feature, which we refer to as auto-scaling, is only a capability of Cloudbreak at this point in time. Per your line of questioning above, if you use Cloudbreak to provision HDP on either Azure IaaS or AWS IaaS, you can use the auto-scaling capabilities it provides. Both Azure HDInsight (HDI) and Hortonworks Data Cloud for AWS (HDC) make it very easy to manually re-size your cluster through their respective consoles. However, the auto-scaling feature described above is not available with either HDI or HDC at this point in time.
... View more
01-02-2017
08:48 PM
1 Kudo
@Vivek Sharma Amazon Virtual Private Cloud (Amazon VPC) enables you to provision a virtual private cloud (VPC), an isolated area within AWS where you can configure a virtual network, controlling aspects such as private IP address ranges, subnets, routing tables, and network gateways. HDCloud requires a VPC, and is therefore limited to the AWS private cloud. From the Network and Security section of the current Hortonworks Data Cloud documentation:In addition to the Amazon EC2 instances created for the cloud controller and cluster nodes, Hortonworks Data Cloud deploys the following network and security AWS resources on your behalf: An Amazon VPC configured with a public subnet: When deploying the cloud controller, you have two options: (1) you can specify an existing VPC, or (2) have the cloud controller create a new VPC. Each cluster is launched into a separate subnet. For more information, see Security documentation. An Internet gateway and a route table (as part of VPC infrastructure): An Internet gateway is used to enable outbound access to the Internet from the control plane and the clusters, and a route table is used to connect the subnet to the Internet gateway. For more information on Amazon VPC architecture, see AWS documentation. Security groups: to control the inbound and outbound traffic to and from the control plane instance. For more information, see Security documentation. IAM instance roles: to hold the permissions to create certain resources. For more information, see Security documentation. If using your own VPC, make sure that: The subnet specified when creating a controller or cluster exists within the specified VPC. Your VPC has an Internet gateway attached. Your VPC has a route table attached. The route table includes a rule that routes all traffic (0.0.0.0/0) to the Internet gateway. This routes all subnet traffic that isn't between the instances within the VPC to the Internet over the Internet gateway. Since the subnets used by HDC must be associated with a route table that has a route to an Internet gateway, they are referred to as Public subnets. Because of this, the system is configured by default to restrict inbound network traffic to a minimal set of ports. The following security groups are created automatically: The CloudbreakSecurityGroup security group is created when launching your cloud controller and is associated with your cloud controller instance. By default, this group enables HTTP (80) and HTTPS (443) access to the Cloud UI and SSH access from the remote locations specified as "Remote Access" CloudFormation parameter. The ClusterNodeSecurityGroupmaster security group is created when you create a cluster and is associated with all Master node(s). By default, this group enables SSH access from the remote locations specified as "Remote Access" parameter when creating the cluster. The ClusterNodeSecurityGroupworker security group is created when you create a cluster and is associated with all Worker node(s). By default, this group enables SSH access from the remote locations specified as "Remote Access" parameter when creating the cluster. See the Ports section of the Security documentation for information about additional ports that may be opened on these groups.
... View more
01-02-2017
06:04 PM
4 Kudos
@Vivek Sharma Yes, you can use CloudFormation to deploy HDP in AWS IaaS. In fact, we use CloudFormation as well as other AWS services within Hortonworks Data Cloud for AWS (HDC) today: Amazon EC2 is used to launch virtual machines. Amazon VPC is used to provision your own dedicated virtual network and launch resources into that network. AWS Identity & Access Management is used to control access to AWS services and resources. AWS CloudFormation is used to create and manage a collection of related AWS resources. AWS Lambda is a utility service for running code in AWS. This service is used when deploying the cloud controller into a new VPC to validate if the VPC and subnet specified exist and if the subnet belongs to that VPC. Amazon S3 provides secure, durable, and highly scalable cloud storage.
Amazon RDS provides a relational database in AWS. This service is used for managing reusable, shared Hive Metastores and as a configuration option when launching the cloud controller. With a formal Hortonworks Subscription in force, Hortonworks will support any HDP cluster that was provisioned through Ambari, regardless of how that provisioning process was scripted. If using our Hortonworks Data Cloud Controller and HDP Services sold through the Amazon Marketplace, then Hortonworks provides and supports the CloudFormation scripts as well. Save yourself some time, and check out HDC first!
... View more
01-02-2017
04:15 PM
2 Kudos
@learninghuman To state it most simply, auto-scaling is a capability of Cloudbreak only at this point in time. With Cloudbreak Periscope, you can define a scaling policy and apply it to any Alert on any Ambari Metric. Scaling granularity is at the Ambari host group level. This provides you the option to scale services or components only, not the whole cluster. Per your line of questioning above, if you use Cloudbreak to provision HDP on either Azure IaaS or AWS IaaS, you can use the auto-scaling capabilities it provides. Both Azure HDInsight (HDI) and Hortonworks Data Cloud for AWS (HDC) make it very easy to manually re-size your cluster through their respective consoles. Auto-scaling is not a feature of either offering at this point in time. In regards to data re-balancing, neither HDI nor HDC need to be concerned with this, because they are both automatically configured to use Cloud Storage (currently ADLS and S3 respectively) - not HDFS. For HDP deployed on IaaS with Cloudbreak, auto-scaling may potentially perform a HDFS rebalance - but only after a Downscale operation. In order to keep a healthy HDFS during downscale, Cloudbreak always keeps the replication factor configured and makes sure that there is enough space on HDFS to rebalance data.
During downscale, in order to minimize the rebalancing, replication, and HDFS storms, Cloudbreak checks block locations and computes the least costly operations.
... View more
12-30-2016
02:25 PM
@learninghuman You can read more about Hadoop Azure Support: Azure Blob Storage in the Apache Doc for Hadoop 2.7.2. You'd need to check with the vendors behind the other distros to see whether or not they support this or not.
... View more
12-28-2016
03:22 PM
1 Kudo
@learninghuman Yes, this is correct as of today. The next major release of HDP (3) will provide support for ADLS and S3 - so if you get started now with either HDI 3.5 or HDC 2.5, you aren't locking yourself into those PaaS offerings long-term. Cloudbreak / HDP will continue to offer you cloud portability.
... View more
12-27-2016
03:56 PM
1 Kudo
@learninghuman If these answers are helpful, please don't forget to Accept the top one for me! Thanks and Happy New Year! _Tom
... View more