About tmccuch

tmccuch · ‎03-30-2018

@Ajay - Thank you for this article! Can you please re-label this to Apache Hadoop HDFS Ozone, rather than Apache Ozone? The latter is not the proper use of Apache branding. Thanks. Tom

tmccuch · ‎03-26-2017

@Francisco Pires Typically, for data warehousing, we recommend logically organizing your data into tiers for processing. The physical organization is a little different for everyone, but here is an example for Hive:

tmccuch · ‎03-02-2017

@eorgad To protect the S3A access/secret keys, it is recommended that you use either: IAM role-based authentication (such as EC2 instance profile), or the Hadoop Credential Provider Framework - securely storing them and accessing them through configuration. The Hadoop Credential Provider Framework allows secure "Credential Providers" to keep secrets outside Hadoop configuration files, storing them in encrypted files in local or Hadoop filesystems, and including them in requests. The Hadoop-AWS Module documentation describes how to configure this properly.

tmccuch · ‎02-09-2017

@Peter Coates As always, it is good to hear from you. In lieu of answering all of your questions outright (since several of them deal with Amazon Proprietary IP) ... If I helped progress your research here, I'd be very appreciative if you could Accept my answer. Thanks. Tom

tmccuch · ‎02-09-2017

@Peter Coates I think the AWS doc is pretty clear on this: "EBS–optimized instances deliver dedicated bandwidth to Amazon EBS". How they do that is unclear - but it is safe to say that creating an overlay network (i.e. VPC, EBS-Optimized, etc.) would require the use of SDN technology. That being said, there is a pretty descriptive chart on expected bandwidth, throughput, and IOPS for Amazon EBS-Optimized Instances here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html. Regarding your observations on S3 performance, you are paying a lot less for S3 - so most of this is to be expected. However - the link I shared with you previously covers the work we are driving in the community around Hadoop-AWS Integration. This work deals with the differences between how HDFS needs to be optimized to interact with Blob Storage API's (such as S3A) going forward as opposed to how it historically interacted with Direct Attached Storage.

tmccuch · ‎02-09-2017

@Peter Coates Because of the nature of the S3 object store, data written to an S3A OutputStream is not written incrementally —instead, by default, it is buffered to disk until the stream is closed in its close() method. This can make output slow: the further the process is from the S3 endpoint, or the smaller the EC-hosted VM is, the longer it will take work to complete. Work to address this began in Hadoop 2.7 with the S3AFastOutputStream HADOOP-11183, and has continued with S3ABlockOutputStream HADOOP-13560. With incremental writes of blocks, "S3A fast upload" offers an upload time at least as fast as the "classic" mechanism, with significant benefits on long-lived output streams, and when very large amounts of data are generated. Please see Hadoop-AWS module: Integration with Amazon Web Services for more information on Hadoop-AWS integration. On instances without support for EBS-optimized throughput, network traffic can contend with traffic between your instance and your EBS volumes. EBS–optimized instances deliver dedicated bandwidth to Amazon EBS, with options between 500 Mbps and 10,000 Mbps, depending on the instance type you use. When you enable EBS optimization for an instance that is not EBS–optimized by default, you pay an additional low, hourly fee for the dedicated capacity.

tmccuch · ‎02-08-2017

@BigDataRocks Please let me know if this helped answer your question. Thanks. Tom

tmccuch · ‎02-04-2017

@BigDataRocks I believe you need to escape the wildcard: val df = spark.sparkContext.textFile("s3n://..../\*.gz). Additionally, the S3N filesystem client, while widely used, is no longer undergoing active maintenance except for emergency security issues. The S3A filesystem client can read all files created by S3N. Accordingly it should be used wherever possible. Please see: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md for the s3a classpath dependencies and authentication properties you need to be aware of. A nice tutorial on this subject can be found here: https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html

tmccuch · ‎01-11-2017

@jhals99 Likewise for the great question! If you could accept my answer, I'd be very appreciative. Thanks. Tom

tmccuch · ‎01-10-2017

@jhals99 I'm looking to provision HDP clusters in an AWS VPC, on Custom Hardened AMI's. Understand from the link below that Cloudbreak w/custom AMI requires support subscription. Which level of support exactly? Cloudbreak in HDP is supported within the Enterprise Plus Subscription. Please see Support Subscriptions on hortonworks.com for more information. I have data on s3 encrypted with Server side encryption. S3A should provide seamless access to encrypted data: yes/no? Yes. In core-site.xml, you'll find the following configuration available: spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY I heard about an Amazon EMR customer using S3 (not local storage/hdfs) for an Hbase environment. I assume S3A will provide the same functionality with Hbase in HDP? No special sauce in EMR right? S3 is not recommended for HBase today. Amazon EMR supports HBase with a combination of HFiles on S3 and the WAL on ephemeral HDFS – this configuration can have data loss in the face of failures. Our HBase team is aware of this and there are a couple other options we are exploring such as HFile on S3 and WAL on EBS / EFS – this is still under investigation. EC2 local disks have to be LUKS encrypted. That should not matter to cloudbreak/hdp right? Correct, Hadoop uses the operating system to read disk. Disk-level encryption is below the operating system and therefore, is not a concern for Cloudbreak/HDP. We require s3a role based authentication. Access to s3 files from Hive, Pig, MR, Spark etc. My question is: Does this work out of the box with the latest release of HDP? Yes. Please refer to the HCC Article: How to Access Data Files stored in AWS S3 Buckets using HDFS / Hive / Pig for more information.

Online	Offline
Last Visited	‎01-15-2021 12:43 PM

Member Since	‎10-08-2015 02:48 PM
Last Visited	‎01-15-2021 12:43 PM
Posts	87
Kudos received	136

Cloudera Community

Re: What is the best way to secure S3A objects on ...

Re: What are the cluster-wide bandwidth limitation...

Re: spark 2.1.0 Reading *.gz files from an s3 buck...

Re: HDP install in AWS VPC, on custom AMI -- feasi...

Re: HDP on Mesos using Marathon, Docker

Re: What is HDFS Ozone?

Re: Data Lake Architecture

Re: What is the best way to secure S3A objects on ...

Re: What are the cluster-wide bandwidth limitation...

Re: What are the cluster-wide bandwidth limitation...

Re: What are the cluster-wide bandwidth limitation...

Re: spark 2.1.0 Reading *.gz files from an s3 buck...

Re: spark 2.1.0 Reading *.gz files from an s3 buck...

Re: HDP install in AWS VPC, on custom AMI -- feasi...

Re: HDP install in AWS VPC, on custom AMI -- feasi...