Member since
04-26-2016
78
Posts
32
Kudos Received
0
Solutions
10-25-2016
11:28 AM
Hi, In HDFS Admin Guide, for copying data across encryption zones (inter-cluster or intra-cluster) it has been recommended to use distcp on /.reserved/raw/source_data_dir instead of source_data_dir directly. I believe the reason behind this is to reduce the unnecessary decryption and encryption of the copied-over data on source and destination respectively. My question is that if we copy from /.reserved/raw directory, the data on the destination would obviously be in encrypted form which means the KMS keys as well need to be copied over separately, like database dump or something like that? Any pointers on what is the best strategy in this case?
... View more
Labels:
06-22-2016
09:44 AM
Hi, I understand some of the servcies can be setup in HA mode as documented in the docs. However, I am trying to understand what does "High Availability" mean for the following HDP services / components. Tez Spark ((Presume its a client-only and hence HA won't be applicable as multiple clients can be installed) Slider Phoenix ((Presume its a client-only and hence HA won't be applicable as multiple clients can be installed) Accumulo Storm (Is it all about setting Nimbus HA?) Falcon Atlas Sqoop (Presume its a client-only and hence HA won't be applicable as multiple clients can be installed. But wondering the role of the database behind Sqoop) Flume Ambari (Presume no native HA available at the moment, but planned for future) Zookeeper (Presume Zookeeper itself is inherently HA due to its ensemble and thats what provides HA to many other components. But wanted to understand it there is more to this.) Knox
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
06-17-2016
07:46 PM
Thanks @Arpit Agarwal for your response. So finally it boils down to choosing RR vs AvailableSpace policies and Hortonworks recommends using RR policy with DiskBalancer vs Cloudera's recommendation of AvailableSpace policy? Am I correct in saying that? 🙂
... View more
06-17-2016
02:47 PM
1 Kudo
Hi @Arpit Agarwal I don't know the intricacies of this. But trying to understand which is a better option - to run the balancer as a recovery mechanism at regular intervals or use a better placement policy while writing the blocks itself. I presume the default block placement policy is RR. So if the placement is round-robin, then the smaller disks are filled-up faster. Instead if the placement policy can take available space and as well as IO throughput for each disk, wouldn't that be a better choice? Also, as documented these two properties are only applicable when dfs.datanode.fsdataset.volume.choosing.policy is set to org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml) But I couldn't find any property named dfs.datanode.fsdataset.volume.choosing.policy. Please let me know where this is set. Please correct me if I am wrong in my understanding.
... View more
06-01-2016
12:46 PM
Thanks @Benjamin Leonhardi I am looking for a solution around a source cluster feeding into two downstream clusters. In another question on HCC, there was a mention of HDF as a good fit and hence wanted to understand the merits in comparison with Falcon
... View more
06-01-2016
10:50 AM
In a teeing based solution where the data is ingested simultaneously to two clusters, can Falcon be used, similar to Flume multi-sink? Alternatively, is it better done with HDF in comparison with Falcon? What are the benefits?
... View more
Labels:
- Labels:
-
Apache Falcon
-
Cloudera DataFlow (CDF)
05-23-2016
10:57 AM
@Benjamin Leonhardi Let me rephrase my question. Assume I have a HDP cluster and then an edge node outside the cluster. On the edge node, I have installed Knox service. My question is to understand which is the better way of ingesting data into HDFS. 1. Should I use the edge node as a staging area to ingest the data first onto the edge node (which means storage is needed on the edge node) and then ingest onto HDFS? This would help to secure the data nodes being exposed to the outside world 2. Alternatively, I can configure Knox service on the edge node such that the WebHDFS API goes through Knox and hence the Namenode URL / IP address is not exposed beyond Knox. In this case, the source directly streams to HDFS, but know doing the address translation. However, no additional storage is needed on the edge node for staging the data temporarily.
... View more
05-23-2016
10:39 AM
@Pradeep Bhadani I have seen solutions where the staging option has been taken.. So just wondering what advantages the option of staging solution brings, in comparison to the direct streaming through Knox WebHDFS API?
... View more
05-23-2016
10:20 AM
Hi, Is it a good practice to stream data directly from the source systems directly into HDFS using Knox exposed WebHDFS APIs, or using the Knox edge node as a staging area before ingesting into HDFS a better one? Thansk
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Knox