About sdutta

sdutta · ‎02-03-2016

So I had some internal discussion and the real answer is dynamic scaling down is hard to achieve. You can scale down using cloudbreak, but cloudbreak does a decommissioning of the service before it kills the docker image. So you can technically do it, but as you do, HDFS will try to relocate the replicas which is going to be time consuming. The alternate is to use something like WASB, where the data is not in HDFS local store but in WASB. The storage and compute are separate so you can turn down instances easily.

sdutta · ‎02-03-2016

@khushi kalra The answer in short is it depends what you are looking for. In Hortonworks platform we have Apache Atlas and Apache Falcon. The 2 tools though under governance has different use case. For Metadata Management with HDP you should use Apache Atlas. The verison 0.5 is the first release of the product and it gets much slicker with the upcoming release. Waterline integrates with Atlas. Waterline will give you metadata discovery, but does not completely integrate with HDP. They run a map reduce job, which will allow you to see patterns in data and say what kind of data it is. Now if you have to take that file metadata and use in conjunction with Hive for any policy work, it will be via Atlas. Atlas is part of the DGI framework. The idea of DGI is to be able to provide an metadata exchange were a community of companies can work in one platform. As Neeraj mentioned, Dataguise is one of them. We have Collibra, Allation and others that are also there. Now the question, I have for you is what are you trying to achieve? Governance is little bit fuzzy in people's mind. Look at the presentation here http://hortonworks.com/partners/learn/#dgi I hope this helps.

sdutta · ‎02-03-2016

How about use of DASH? Cloudbreak suggests DASH with WASB.

sdutta · ‎02-02-2016

A customer wants to use Cloudbreak for deploying Hadoop Clusters. They want to scale up and down the hadoop storage nodes. a) How does HDFS detect Scale down of nodes and will it kick in HDFS Rebalance - Cloudbreak instructs Ambari via decommission REST Api call to decommission a DataNode and NodeManager. - Ambari triggers the decomission on the HDP cluster, from birds perspective this is what happens: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Sys_Admin_Guides/content/ref-a179736c-eb7c-4dda-b3b4-6f3a778bd8c8.1.html but in automated way. - Since the decommission of DataNodes can take a long time if you have a lot of blocks; HDFS needs to replicate blocks belonging to decommissioning DataNodes to other live DataNodes to reach the replication factor that you have specified via dfs.replication in hdfs-site.xml. The default value of this replication factor is 3. - You can get feedback from the decomission process e.g from NameNode UI: http://ip of_namenode:50070/dfshealth.html#tab-datanode or you can use command line tools like "hdfs fsck /" - Cloudbreak periodically polls Ambari about the status of decomissioning and Ambari monitors the NameNode - if the decomissioning is finished then Cloudbreak removes the node from Ambari and delete the decomissioned VMs from cloud providers b) For Scale Up, would we need to manually kick off hdfs rebalance? - Cloudbreak does not trigger HDFS rebalance. c) How do you know if you have lost a block, example if you scale down 8 out of your 10 nodes, how would hdfs handle this case. Assuming you have enough storage in the 2 nodes. - HDFS: If you do not have enough live DataNodes to reach the replication factor, decommission process would hang until more DataNodes become available (e.g., if you have 10 DataNodes in your cluster with dfs.replication is set to 3 then you are able to scale down you cluster to 3 nodes) - Cloudbreak: if you have 10 DataNodes with replication factor of 3 then Cloudbreak don't even let you remove more than 7 instances and you get back a "Cluster downscale failed.: There is not enough node to downscale. Check the replication factor and the ApplicationMaster occupation." error message

sdutta · ‎01-20-2016

@Mehdi TAZI Having small files in HDFS will create issues with Namenode filling up quickly and the blocks being too small. There are number of ways you can combine the files to create a right sized Files. You can also try and see if HAR is an option. But Hbase can be an option. The Key design will be critical. You can also look at OpenTSDB if it is time series kind of data. Yes, you will have to deal with Hbase compaction, node rebuild etc.

sdutta · ‎01-20-2016

For non Java access you would need to setup the thrift server. Thrift server runs on the port 9090. I hope this helps

sdutta · ‎01-20-2016

HDP2.3 Installation on Single Node CentOS In this step we will start creating the HDP Compute Cluster. We will create VM with CentOS 6.7. Deploy HDP 2.3 with Ambari 2.1 on a single node. In the second part you will create a Docker instance and make it a Datanode to the existing instance in the same VM. The tutorial is to show how easy it is use docker and create a multi node instance. Details Create HDP on a Single Node VM Create a docker node and add it as an data node to the above VM Have Fun!

sdutta · ‎01-19-2016

@niraj nagle Are you trying to download the Sandbox or are you trying to install HDP using a repo file. I guess the former. I just tried it using a Chrome browser and it went past. Can you retry?

sdutta · ‎01-19-2016

@Ancil McBarnett I would not put the OS on the SAN. Where would the OS Cache be configured. This is usually not done, what are the benefits of putting the OS on SAN? It is an interesting thoughts and if you do tryout do share the results.

sdutta · ‎01-14-2016

@Anshul Sisodia - It looks like you have a connection issue. a) Check on the destination host if the datanode is up and running b) you can run tcpdump between the two hosts and port and monitor the traffic https://danielmiessler.com/study/tcpdump/ tcpdump is an excellent tool that can give you lot of network related problem information.

Online	Offline
Last Visited	‎12-21-2017 03:09 AM

Member Since	‎09-18-2015 03:08 PM
Last Visited	‎12-21-2017 03:09 AM
Posts	100
Kudos received	96

Cloudera Community

Re: Sandbox HDFS Replication Set to 3 - Why?

Re: What are all the best possible ways to install...

Re: Manual KDC and kerberos option in ambari

Re: Atlas metadata hooks

Re: Can we able to kill particular container in th...

Re: Impact of HDFS using Cloudbreak Scale Down

Re: Hi I am new to falcon , can anyone help me wit...

Re: Need instructions to setup WASB as storage for...

Impact of HDFS using Cloudbreak Scale Down

Re: Can I use Hbase as a datalake

Re: How to config thrift in Hbase?

Docker - Installing HDP using Ambari and Creating ...

Re: problems with the download link of HDP2-3-2

Re: Can I place the OS for the nodes Hadoop Cluste...

Re: DataXceiver error processing READ_BLOCK operat...