Member since
10-08-2015
87
Posts
142
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
183 | 03-02-2017 03:39 PM | |
1286 | 02-09-2017 06:43 PM | |
7109 | 02-04-2017 05:38 PM | |
842 | 01-10-2017 10:24 PM | |
974 | 01-05-2017 06:58 PM |
03-30-2018
04:19 PM
@Ajay - Thank you for this article! Can you please re-label this to Apache Hadoop HDFS Ozone, rather than Apache Ozone? The latter is not the proper use of Apache branding. Thanks. Tom
... View more
03-29-2017
06:01 PM
@Anandakrishnan Ramakrishnan --delete-target-dir is meant to delete the <HDFS-target-dir> provided in command before writing data to this directory. If it isn't a permissions issue, I am suspecting that it may be failing because this isn't an HDFS directory.
... View more
03-26-2017
04:38 PM
2 Kudos
@Francisco Pires Typically, for data warehousing, we recommend logically organizing your data into tiers for processing. The physical organization is a little different for everyone, but here is an example for Hive:
... View more
03-02-2017
03:39 PM
1 Kudo
@eorgad To protect the S3A access/secret keys, it is recommended that you use either: IAM role-based authentication (such as EC2 instance profile), or the Hadoop Credential Provider Framework - securely storing them and accessing them through configuration. The Hadoop Credential Provider Framework allows secure "Credential Providers" to keep secrets outside Hadoop configuration files, storing them in encrypted files in local or Hadoop filesystems, and including them in requests. The Hadoop-AWS Module documentation describes how to configure this properly.
... View more
02-09-2017
08:23 PM
1 Kudo
@Peter Coates As always, it is good to hear from you. In lieu of answering all of your questions outright (since several of them deal with Amazon Proprietary IP) ... If I helped progress your research here, I'd be very appreciative if you could Accept my answer. Thanks. Tom
... View more
02-09-2017
06:43 PM
1 Kudo
@Peter Coates I think the AWS doc is pretty clear on this: "EBS–optimized instances deliver dedicated bandwidth to Amazon EBS". How they do that is unclear - but it is safe to say that creating an overlay network (i.e. VPC, EBS-Optimized, etc.) would require the use of SDN technology. That being said, there is a pretty descriptive chart on expected bandwidth, throughput, and IOPS for Amazon EBS-Optimized Instances here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html. Regarding your observations on S3 performance, you are paying a lot less for S3 - so most of this is to be expected. However - the link I shared with you previously covers the work we are driving in the community around Hadoop-AWS Integration. This work deals with the differences between how HDFS needs to be optimized to interact with Blob Storage API's (such as S3A) going forward as opposed to how it historically interacted with Direct Attached Storage.
... View more
02-09-2017
12:38 AM
1 Kudo
@Peter Coates Because of the nature of the S3 object store, data written to an S3A OutputStream is not written incrementally —instead, by default, it is buffered to disk until the stream is closed in its close() method. This can make output slow: the further the process is from the S3 endpoint, or the smaller the EC-hosted VM is, the longer it will take work to complete. Work to address this began in Hadoop 2.7 with the S3AFastOutputStream HADOOP-11183, and has continued with S3ABlockOutputStream HADOOP-13560. With incremental writes of blocks, "S3A fast upload" offers an upload time at least as fast as the "classic" mechanism, with significant benefits on long-lived output streams, and when very large amounts of data are generated. Please see Hadoop-AWS module: Integration with Amazon Web Services for more information on Hadoop-AWS integration. On instances without support for EBS-optimized throughput, network traffic can contend with traffic between your instance and your EBS volumes. EBS–optimized instances deliver dedicated bandwidth to Amazon EBS, with options between 500 Mbps and 10,000 Mbps, depending on the instance type you use. When you enable EBS optimization for an instance that is not EBS–optimized by default, you pay an additional low, hourly fee for the dedicated capacity.
... View more
02-08-2017
02:00 PM
1 Kudo
@BigDataRocks Please let me know if this helped answer your question. Thanks. Tom
... View more
02-04-2017
05:38 PM
2 Kudos
@BigDataRocks I believe you need to escape the wildcard: val df = spark.sparkContext.textFile("s3n://..../\*.gz). Additionally, the S3N filesystem client, while widely used, is no longer undergoing active maintenance except for emergency security issues. The S3A filesystem client can read all files created by S3N. Accordingly it should be used wherever possible. Please see: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md for the s3a classpath dependencies and authentication properties you need to be aware of. A nice tutorial on this subject can be found here: https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html
... View more
01-30-2017
09:22 PM
1 Kudo
@shashi cheppela Let me know how things are going. If I've helped answer your question, I'd very much appreciate if you would recognize my effort by Accepting the answer. Thanks. Tom
... View more
01-29-2017
01:34 PM
1 Kudo
@shashi cheppela I see you are using --delete-target-dir. This attempts to delete the import target directory if it exists. Permissions on your S3 bucket may be tripping you up. I'd start by removing that. You may want to double-check the s3a classpath dependencies and authentication properties you are using. see: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md see also: http://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html
... View more
01-28-2017
02:48 PM
1 Kudo
@shashi cheppela Are you using --hive-import? If so, you may be running into this: SQOOP-3403. Workaround is to not use --hive-import with s3:// as --target-dir.
... View more
01-14-2017
02:34 AM
1 Kudo
@Yan Liu We are working on it! See recent news about JanusGraph. To learn more and get involved, visit: https://github.com/JanusGraph/janusgraph. _Tom
... View more
01-14-2017
12:40 AM
1 Kudo
@Aditya Hegde There is a good HCC article about this by @Vasilis Vagias, titled HDF on HDI - NiFi.
... View more
01-13-2017
02:47 PM
1 Kudo
@Ali The full list of ports is actually spread across multiple docs: Ambari Ports HDP Ports HDF Ports The only required NiFi ports needed are the UI ports (http and/or https) and any ports needed by processors for data. The two config properties are: nifi.web.http.port and nifi.web.https.port. For this reason, there isn't one place to go in our HDF docs, however, the link I provided above is another HCC thread that walks you through it pretty well. Ports for all other HDF components (Storm, Kafka, Ambari) are covered in the other two links I provided above. Ephemeral ports are assigned by the OS and allocated automatically from a predefined range. For example, Linux uses the port range 32768 to 61000. HDFS does default to ephemeral ports for some HTTP/RPC endpoints. This can cause bind exceptions on service startup if the port is in use. For this reason, HDFS-9427 was created to update the HDFS default HTTP/RPC ports to non-ephemeral ports. This is resolved in Hadoop 3.0. Although other ephemeral ports are used by the services you mention, those ports are not exposed through config. When configuring firewalls, you can not account for every random port that may be assigned by the OS for use. That is why firewall rules are often directional. For example, you wouldn’t make a firewall rule that said “Allow traffic from local port 53446 (randomly assigned) to remote port 50070”. Your firewall rule would be more like “allow a TCP connection originating locally destined for port 50070 on host XXXX”.
... View more
01-11-2017
06:43 PM
1 Kudo
@jhals99 Likewise for the great question! If you could accept my answer, I'd be very appreciative. Thanks. Tom
... View more
01-10-2017
10:24 PM
2 Kudos
@jhals99 I'm looking to provision HDP clusters in an AWS VPC, on Custom Hardened AMI's. Understand from the link below
that Cloudbreak w/custom AMI requires support subscription. Which level of support exactly? Cloudbreak in HDP is supported within the Enterprise Plus Subscription. Please see Support Subscriptions on hortonworks.com for more information. I have data on s3 encrypted with Server side encryption. S3A should provide seamless access to encrypted data: yes/no? Yes. In core-site.xml, you'll find the following configuration available: spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY I heard about an Amazon EMR customer using S3 (not local storage/hdfs) for an Hbase environment. I assume S3A will provide the same functionality with Hbase in HDP? No special sauce in EMR right? S3 is not recommended for HBase today. Amazon EMR supports HBase with a combination of HFiles on S3 and the WAL on ephemeral HDFS – this configuration can have data loss in the face of failures. Our HBase team is aware of this and there are a couple other options we are exploring such as HFile on S3 and WAL on EBS / EFS – this is still under investigation. EC2 local disks have to be LUKS encrypted. That should not matter to cloudbreak/hdp right? Correct, Hadoop uses the operating system to read disk. Disk-level encryption is below the operating system and therefore, is not a concern for Cloudbreak/HDP. We require s3a role based authentication. Access to s3 files from Hive, Pig, MR, Spark etc. My question is: Does this work out of the box with the latest release of HDP? Yes. Please refer to the HCC Article: How to Access Data Files stored in AWS S3 Buckets using HDFS / Hive / Pig for more information.
... View more
01-07-2017
05:20 PM
1 Kudo
@learninghuman I hope this answered your questions. If so, please remember to accept my answer. Thank you! _Tom
... View more
01-06-2017
10:49 PM
1 Kudo
@learninghuman
To help clarify, all of the data access components within HDP run on YARN. We view Mesos as one of the many alternatives for IaaS within the private cloud space (Openstack, VMware, etc.). Our aim is to support them all and provide our customers both connectivity and portability across them with HDF and HDP.
... View more
01-05-2017
06:58 PM
2 Kudos
@learninghuman Cloudbreak currently offers a Technical Preview for deployment on Mesos. At a high level, Cloudbreak deployment on Mesos is similar to Cloudbreak implementations on other cloud providers: HDP clusters are provisioned through Ambari with the help of blueprints, and Ambari server and agents run in Docker containers. However, there are some important differences with other cloud providers:
Cloudbreak expects a "bring your own Mesos" infrastructure, which means that you have to deploy Mesos first and then configure access to the existing Mesos deployment in Cloudbreak. The Cloudbreak Mesos integration was designed not to include steps to first build the infrastructure as it does for other cloud environments, such as: creating or reusing the networking layer (virtual networks, subnets, and so on), provisioning new virtual machines in these networks from pre-existing cloud images, and starting docker containers on these VMs (nodes). From a YARN perspective, there is no difference between Mesos and other public / private clouds. YARN manages the compute capacity provided to it by the IaaS layer.
... View more
01-05-2017
02:52 PM
1 Kudo
@Mahen Jay Perhaps the best way to test real-time issues from a student labs perspective is to dig into Ambari and HDP Upgrade. This is covered (with labs) in our HDP Operations: Hadoop Administration 2 course. See the HWU Training Catalog for more details.
... View more
01-03-2017
05:49 PM
@learninghuman If this answer helps, please accept it. Otherwise, I'd be happy to answer any remaining questions you have.
Thanks! _Tom
... View more
01-03-2017
05:48 PM
@Vivek Sharma If this answer helps, please accept it. Otherwise, I'd be happy to answer any remaining questions you have.
Thanks! _Tom
... View more
01-03-2017
05:47 PM
@Vivek Sharma If this answer helps, please accept it. Otherwise, I'd be happy to answer any remaining questions you have.
Thanks! _Tom
... View more
01-03-2017
05:46 PM
@Vivek Sharma If this answer helps, please accept it. Otherwise, I'd be happy to answer any remaining questions you have.
Thanks! _Tom
... View more
01-02-2017
09:46 PM
2 Kudos
@Vivek Sharma With Cloudbreak Periscope, you can define a scaling policy and apply it to any Alert on any Ambari Metric. Scaling granularity is at the Ambari host group level. This feature, which we refer to as auto-scaling, is only a capability of Cloudbreak at this point in time. Per your line of questioning above, if you use Cloudbreak to provision HDP on either Azure IaaS or AWS IaaS, you can use the auto-scaling capabilities it provides. Both Azure HDInsight (HDI) and Hortonworks Data Cloud for AWS (HDC) make it very easy to manually re-size your cluster through their respective consoles. However, the auto-scaling feature described above is not available with either HDI or HDC at this point in time.
... View more
01-02-2017
09:24 PM
@learninghuman HDInsight does not require a secure Virtual Network. However, since it is a Managed Service, if you need to install HDInsight into a secured Virtual Network, you must allow inbound access over port 443 for the following IP addresses, which allow Azure to manage the HDInsight cluster.
168.61.49.99 23.99.5.239 168.61.48.131 138.91.141.162
... View more
01-02-2017
08:48 PM
1 Kudo
@Vivek Sharma Amazon Virtual Private Cloud (Amazon VPC) enables you to provision a virtual private cloud (VPC), an isolated area within AWS where you can configure a virtual network, controlling aspects such as private IP address ranges, subnets, routing tables, and network gateways. HDCloud requires a VPC, and is therefore limited to the AWS private cloud. From the Network and Security section of the current Hortonworks Data Cloud documentation:In addition to the Amazon EC2 instances created for the cloud controller and cluster nodes, Hortonworks Data Cloud deploys the following network and security AWS resources on your behalf: An Amazon VPC configured with a public subnet: When deploying the cloud controller, you have two options: (1) you can specify an existing VPC, or (2) have the cloud controller create a new VPC. Each cluster is launched into a separate subnet. For more information, see Security documentation. An Internet gateway and a route table (as part of VPC infrastructure): An Internet gateway is used to enable outbound access to the Internet from the control plane and the clusters, and a route table is used to connect the subnet to the Internet gateway. For more information on Amazon VPC architecture, see AWS documentation. Security groups: to control the inbound and outbound traffic to and from the control plane instance. For more information, see Security documentation. IAM instance roles: to hold the permissions to create certain resources. For more information, see Security documentation. If using your own VPC, make sure that: The subnet specified when creating a controller or cluster exists within the specified VPC. Your VPC has an Internet gateway attached. Your VPC has a route table attached. The route table includes a rule that routes all traffic (0.0.0.0/0) to the Internet gateway. This routes all subnet traffic that isn't between the instances within the VPC to the Internet over the Internet gateway. Since the subnets used by HDC must be associated with a route table that has a route to an Internet gateway, they are referred to as Public subnets. Because of this, the system is configured by default to restrict inbound network traffic to a minimal set of ports. The following security groups are created automatically: The CloudbreakSecurityGroup security group is created when launching your cloud controller and is associated with your cloud controller instance. By default, this group enables HTTP (80) and HTTPS (443) access to the Cloud UI and SSH access from the remote locations specified as "Remote Access" CloudFormation parameter. The ClusterNodeSecurityGroupmaster security group is created when you create a cluster and is associated with all Master node(s). By default, this group enables SSH access from the remote locations specified as "Remote Access" parameter when creating the cluster. The ClusterNodeSecurityGroupworker security group is created when you create a cluster and is associated with all Worker node(s). By default, this group enables SSH access from the remote locations specified as "Remote Access" parameter when creating the cluster. See the Ports section of the Security documentation for information about additional ports that may be opened on these groups.
... View more