About techiegreenhorn

techiegreenhorn · ‎10-25-2016

Hi, In HDFS Admin Guide, for copying data across encryption zones (inter-cluster or intra-cluster) it has been recommended to use distcp on /.reserved/raw/source_data_dir instead of source_data_dir directly. I believe the reason behind this is to reduce the unnecessary decryption and encryption of the copied-over data on source and destination respectively. My question is that if we copy from /.reserved/raw directory, the data on the destination would obviously be in encrypted form which means the KMS keys as well need to be copied over separately, like database dump or something like that? Any pointers on what is the best strategy in this case?

techiegreenhorn · ‎06-22-2016

Hi, I understand some of the servcies can be setup in HA mode as documented in the docs. However, I am trying to understand what does "High Availability" mean for the following HDP services / components. Tez Spark ((Presume its a client-only and hence HA won't be applicable as multiple clients can be installed) Slider Phoenix ((Presume its a client-only and hence HA won't be applicable as multiple clients can be installed) Accumulo Storm (Is it all about setting Nimbus HA?) Falcon Atlas Sqoop (Presume its a client-only and hence HA won't be applicable as multiple clients can be installed. But wondering the role of the database behind Sqoop) Flume Ambari (Presume no native HA available at the moment, but planned for future) Zookeeper (Presume Zookeeper itself is inherently HA due to its ensemble and thats what provides HA to many other components. But wanted to understand it there is more to this.) Knox

techiegreenhorn · ‎06-17-2016

Thanks @Arpit Agarwal for your response. So finally it boils down to choosing RR vs AvailableSpace policies and Hortonworks recommends using RR policy with DiskBalancer vs Cloudera's recommendation of AvailableSpace policy? Am I correct in saying that? 🙂

techiegreenhorn · ‎06-17-2016

Hi @Arpit Agarwal I don't know the intricacies of this. But trying to understand which is a better option - to run the balancer as a recovery mechanism at regular intervals or use a better placement policy while writing the blocks itself. I presume the default block placement policy is RR. So if the placement is round-robin, then the smaller disks are filled-up faster. Instead if the placement policy can take available space and as well as IO throughput for each disk, wouldn't that be a better choice? Also, as documented these two properties are only applicable when dfs.datanode.fsdataset.volume.choosing.policy is set to org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml) But I couldn't find any property named dfs.datanode.fsdataset.volume.choosing.policy. Please let me know where this is set. Please correct me if I am wrong in my understanding.

techiegreenhorn · ‎06-01-2016

Thanks @Benjamin Leonhardi I am looking for a solution around a source cluster feeding into two downstream clusters. In another question on HCC, there was a mention of HDF as a good fit and hence wanted to understand the merits in comparison with Falcon

techiegreenhorn · ‎06-01-2016

In a teeing based solution where the data is ingested simultaneously to two clusters, can Falcon be used, similar to Flume multi-sink? Alternatively, is it better done with HDF in comparison with Falcon? What are the benefits?

techiegreenhorn · ‎05-23-2016

@Benjamin Leonhardi Thanks for your response.

techiegreenhorn · ‎05-23-2016

@Benjamin Leonhardi Let me rephrase my question. Assume I have a HDP cluster and then an edge node outside the cluster. On the edge node, I have installed Knox service. My question is to understand which is the better way of ingesting data into HDFS. 1. Should I use the edge node as a staging area to ingest the data first onto the edge node (which means storage is needed on the edge node) and then ingest onto HDFS? This would help to secure the data nodes being exposed to the outside world 2. Alternatively, I can configure Knox service on the edge node such that the WebHDFS API goes through Knox and hence the Namenode URL / IP address is not exposed beyond Knox. In this case, the source directly streams to HDFS, but know doing the address translation. However, no additional storage is needed on the edge node for staging the data temporarily.

techiegreenhorn · ‎05-23-2016

@Pradeep Bhadani I have seen solutions where the staging option has been taken.. So just wondering what advantages the option of staging solution brings, in comparison to the direct streaming through Knox WebHDFS API?

techiegreenhorn · ‎05-23-2016

Hi, Is it a good practice to stream data directly from the source systems directly into HDFS using Knox exposed WebHDFS APIs, or using the Knox edge node as a staging area before ingesting into HDFS a better one? Thansk

Online	Offline
Last Visited	‎11-19-2017 12:45 PM

Member Since	‎04-26-2016 05:28 AM
Last Visited	‎11-19-2017 12:45 PM
Posts	78
Kudos received	32

Cloudera Community

Using Distcp on Encryption Zones

HDP services High Availability

Re: General guidelines and best practices for tuni...

Re: General guidelines and best practices for tuni...

Re: Falcon for Teeing

Falcon for Teeing

Re: Staging on edge nodes

Re: Staging on edge nodes

Re: Staging on edge nodes

Staging on edge nodes