Member since 
    
	
		
		
		09-18-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                100
            
            
                Posts
            
        
                98
            
            
                Kudos Received
            
        
                11
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 2137 | 03-22-2016 02:05 AM | |
| 1390 | 03-17-2016 06:16 AM | |
| 4975 | 03-17-2016 06:13 AM | |
| 1799 | 03-12-2016 04:48 AM | |
| 5752 | 03-10-2016 08:04 PM | 
			
    
	
		
		
		02-03-2016
	
		
		07:18 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 So  I had some internal discussion and the real answer is dynamic scaling down is hard to achieve. You can scale down using cloudbreak, but cloudbreak does a decommissioning of the service before it kills the docker image. So you can technically do it, but as you do, HDFS will try to relocate the replicas which is going to be time consuming.  The alternate is to use something like WASB, where the data is not in HDFS local store but in WASB. The storage and compute are separate so you can turn down instances easily.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-03-2016
	
		
		06:39 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @khushi kalra  The answer in short is it depends what you are looking for. In Hortonworks platform we have Apache Atlas and Apache Falcon. The 2 tools though under governance has different use case.  For Metadata Management with HDP you should use Apache Atlas. The verison 0.5 is the first release of the product and it gets much slicker with the upcoming release.  Waterline integrates with Atlas. Waterline will give you metadata discovery, but does not completely integrate with HDP. They run a map reduce job, which will allow you to see patterns in data and say what kind of data it is. Now if you have to take that file metadata and use in conjunction with Hive for any policy work, it will be via Atlas.  Atlas is part of the DGI framework. The idea of DGI is to be able to provide an metadata exchange were a community of companies can work in one platform. As Neeraj mentioned, Dataguise is one of them. We have Collibra, Allation and others that are also there.  Now the question, I have for you is what are you trying to achieve? Governance is little bit fuzzy in people's mind.  Look at the presentation here  http://hortonworks.com/partners/learn/#dgi  I hope this helps. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-03-2016
	
		
		05:38 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 How about use of DASH? Cloudbreak suggests DASH with WASB. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-02-2016
	
		
		03:02 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		5 Kudos
		
	
				
		
	
		
					
							 A customer wants to use Cloudbreak for deploying Hadoop Clusters. They want to scale up and down the hadoop storage nodes.  a) How does HDFS detect Scale down of nodes and will it kick in HDFS Rebalance   - Cloudbreak instructs Ambari via decommission REST Api call to decommission a DataNode and NodeManager.   - Ambari triggers the decomission on the HDP cluster, from birds perspective this is what happens: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Sys_Admin_Guides/content/ref-a179736c-eb7c-4dda-b3b4-6f3a778bd8c8.1.html  but in automated way.   - Since the decommission of DataNodes can take a long time if you have a lot of blocks; HDFS needs to replicate blocks belonging to decommissioning DataNodes to other live DataNodes to reach the replication factor that you have specified via dfs.replication in hdfs-site.xml. The default value of this replication factor is 3.   - You can get feedback from the decomission process e.g from NameNode UI: http://ip of_namenode:50070/dfshealth.html#tab-datanode or you can use command line tools like "hdfs fsck /"   - Cloudbreak periodically polls Ambari about the status of decomissioning and Ambari monitors the NameNode   - if the decomissioning is finished then Cloudbreak removes the node from Ambari and delete the decomissioned VMs from cloud providers  b) For Scale Up, would we need to manually kick off hdfs rebalance?   - Cloudbreak does not trigger HDFS rebalance.  c) How do you know if you have lost a block, example if you scale down 8 out of your 10 nodes, how would hdfs handle this case. Assuming you have enough storage in the 2 nodes.    - HDFS: If you do not have enough live DataNodes to reach the replication factor, decommission process would hang until more DataNodes become available (e.g., if you have 10 DataNodes in your cluster with dfs.replication is set to 3 then you are able to scale down you cluster to 3 nodes)     - Cloudbreak: if you have 10 DataNodes with replication factor of 3 then Cloudbreak don't even let you remove more than 7 instances and you get back a "Cluster downscale failed.: There is not enough node to downscale. Check the replication factor and the ApplicationMaster occupation." error message 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		01-20-2016
	
		
		11:38 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Mehdi TAZI  Having small files in HDFS will create issues with Namenode filling up quickly and the blocks being too small. There are number of ways you can combine the files to create a right sized Files. You can also try and see if  HAR is an option.  But Hbase can be an option. The Key design will be critical. You can also look at OpenTSDB if it is time series kind of data. Yes, you will have to deal with Hbase compaction, node rebuild etc. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-20-2016
	
		
		11:22 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 For non Java access you would need to setup the thrift server. Thrift server runs on the port 9090. I hope this helps 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-20-2016
	
		
		10:35 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		5 Kudos
		
	
				
		
	
		
					
							 HDP2.3 Installation on Single Node CentOS  In this step we will start creating the HDP Compute Cluster. We will create VM with CentOS 6.7. Deploy HDP 2.3 with Ambari 2.1 on a single node.   In the second part you will create a Docker instance and make it a Datanode to the existing instance in the same VM.  The tutorial is to show how easy it is use docker and create a multi node instance.  Details  
 Create HDP on a Single Node VM   
 Create a docker node and add it as an data node to the above VM   Have Fun! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-19-2016
	
		
		10:09 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @niraj nagle  Are you trying to download the Sandbox or are you trying to install HDP using a repo file. I guess the former. I just tried it using a Chrome browser and it went past. Can you retry?  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-19-2016
	
		
		07:39 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							@Ancil McBarnett I would not put the OS on the SAN. Where would the OS Cache be configured. This is usually not done, what are the benefits of putting the OS on SAN? It is an interesting thoughts and if you do tryout do share the results. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-14-2016
	
		
		04:51 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @Anshul Sisodia - It looks like you have a connection issue.  a) Check on the destination host if the datanode is up and running  b) you can run tcpdump between the two hosts and port and monitor the traffic  https://danielmiessler.com/study/tcpdump/  tcpdump is an excellent tool that can give you lot of network related problem information. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













