Member since
09-18-2015
100
Posts
98
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1468 | 03-22-2016 02:05 AM | |
1018 | 03-17-2016 06:16 AM | |
1821 | 03-17-2016 06:13 AM | |
1317 | 03-12-2016 04:48 AM | |
4712 | 03-10-2016 08:04 PM |
02-03-2016
07:18 PM
So I had some internal discussion and the real answer is dynamic scaling down is hard to achieve. You can scale down using cloudbreak, but cloudbreak does a decommissioning of the service before it kills the docker image. So you can technically do it, but as you do, HDFS will try to relocate the replicas which is going to be time consuming. The alternate is to use something like WASB, where the data is not in HDFS local store but in WASB. The storage and compute are separate so you can turn down instances easily.
... View more
02-03-2016
06:39 PM
1 Kudo
@khushi kalra The answer in short is it depends what you are looking for. In Hortonworks platform we have Apache Atlas and Apache Falcon. The 2 tools though under governance has different use case. For Metadata Management with HDP you should use Apache Atlas. The verison 0.5 is the first release of the product and it gets much slicker with the upcoming release. Waterline integrates with Atlas. Waterline will give you metadata discovery, but does not completely integrate with HDP. They run a map reduce job, which will allow you to see patterns in data and say what kind of data it is. Now if you have to take that file metadata and use in conjunction with Hive for any policy work, it will be via Atlas. Atlas is part of the DGI framework. The idea of DGI is to be able to provide an metadata exchange were a community of companies can work in one platform. As Neeraj mentioned, Dataguise is one of them. We have Collibra, Allation and others that are also there. Now the question, I have for you is what are you trying to achieve? Governance is little bit fuzzy in people's mind. Look at the presentation here http://hortonworks.com/partners/learn/#dgi I hope this helps.
... View more
02-03-2016
05:38 PM
How about use of DASH? Cloudbreak suggests DASH with WASB.
... View more
02-02-2016
03:02 AM
5 Kudos
A customer wants to use Cloudbreak for deploying Hadoop Clusters. They want to scale up and down the hadoop storage nodes. a) How does HDFS detect Scale down of nodes and will it kick in HDFS Rebalance - Cloudbreak instructs Ambari via decommission REST Api call to decommission a DataNode and NodeManager. - Ambari triggers the decomission on the HDP cluster, from birds perspective this is what happens: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Sys_Admin_Guides/content/ref-a179736c-eb7c-4dda-b3b4-6f3a778bd8c8.1.html but in automated way. - Since the decommission of DataNodes can take a long time if you have a lot of blocks; HDFS needs to replicate blocks belonging to decommissioning DataNodes to other live DataNodes to reach the replication factor that you have specified via dfs.replication in hdfs-site.xml. The default value of this replication factor is 3. - You can get feedback from the decomission process e.g from NameNode UI: http://ip of_namenode:50070/dfshealth.html#tab-datanode or you can use command line tools like "hdfs fsck /" - Cloudbreak periodically polls Ambari about the status of decomissioning and Ambari monitors the NameNode - if the decomissioning is finished then Cloudbreak removes the node from Ambari and delete the decomissioned VMs from cloud providers b) For Scale Up, would we need to manually kick off hdfs rebalance? - Cloudbreak does not trigger HDFS rebalance. c) How do you know if you have lost a block, example if you scale down 8 out of your 10 nodes, how would hdfs handle this case. Assuming you have enough storage in the 2 nodes. - HDFS: If you do not have enough live DataNodes to reach the replication factor, decommission process would hang until more DataNodes become available (e.g., if you have 10 DataNodes in your cluster with dfs.replication is set to 3 then you are able to scale down you cluster to 3 nodes) - Cloudbreak: if you have 10 DataNodes with replication factor of 3 then Cloudbreak don't even let you remove more than 7 instances and you get back a "Cluster downscale failed.: There is not enough node to downscale. Check the replication factor and the ApplicationMaster occupation." error message
... View more
Labels:
01-20-2016
11:38 PM
@Mehdi TAZI Having small files in HDFS will create issues with Namenode filling up quickly and the blocks being too small. There are number of ways you can combine the files to create a right sized Files. You can also try and see if HAR is an option. But Hbase can be an option. The Key design will be critical. You can also look at OpenTSDB if it is time series kind of data. Yes, you will have to deal with Hbase compaction, node rebuild etc.
... View more
01-20-2016
11:22 PM
For non Java access you would need to setup the thrift server. Thrift server runs on the port 9090. I hope this helps
... View more
01-20-2016
10:35 PM
5 Kudos
HDP2.3 Installation on Single Node CentOS In this step we will start creating the HDP Compute Cluster. We will create VM with CentOS 6.7. Deploy HDP 2.3 with Ambari 2.1 on a single node. In the second part you will create a Docker instance and make it a Datanode to the existing instance in the same VM. The tutorial is to show how easy it is use docker and create a multi node instance. Details
Create HDP on a Single Node VM
Create a docker node and add it as an data node to the above VM Have Fun!
... View more
01-19-2016
10:09 PM
@niraj nagle Are you trying to download the Sandbox or are you trying to install HDP using a repo file. I guess the former. I just tried it using a Chrome browser and it went past. Can you retry?
... View more
01-19-2016
07:39 PM
2 Kudos
@Ancil McBarnett I would not put the OS on the SAN. Where would the OS Cache be configured. This is usually not done, what are the benefits of putting the OS on SAN? It is an interesting thoughts and if you do tryout do share the results.
... View more
01-14-2016
04:51 PM
1 Kudo
@Anshul Sisodia - It looks like you have a connection issue. a) Check on the destination host if the datanode is up and running b) you can run tcpdump between the two hosts and port and monitor the traffic https://danielmiessler.com/study/tcpdump/ tcpdump is an excellent tool that can give you lot of network related problem information.
... View more