Member since 
    
	
		
		
		02-29-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                23
            
            
                Posts
            
        
                6
            
            
                Kudos Received
            
        
                0
            
            
                Solutions
            
        
			
    
	
		
		
		04-17-2017
	
		
		11:39 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Basic but have you tried restarting the history-server?    ./sbin/start-history-server.sh restart 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-14-2017
	
		
		05:47 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Sachin Ambardekar, The doc above may be slightly dated. Rule of thumb, 4GB per core seems to be the sweet spot for memory intensive workloads which are getting more common nowadays.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-14-2017
	
		
		05:31 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks man, It wasn't clear if CB 1.6 could supports ADLS. A better option than WASB shards with DASH for sure.. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-10-2017
	
		
		07:51 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Quick update, DASH is a package available from MSFT that allows "sharding" accross multiple accounts:   https://github.com/MicrosoftDX/Dash/tree/master/DashServer  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-10-2017
	
		
		06:56 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi,   I have a use case where an HDP cluster on Azure is used to dev and test. Ideally, we would like to separate the dev and test data in 2 different WASB storage accounts. Is there a way to define multiple account and keys in core-site.xml? And how would it map on the file system? Would it simply be wasb://mybucket[1-2]?   Thanks!    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Hortonworks Data Platform (HDP)
			
    
	
		
		
		02-09-2017
	
		
		02:44 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 
	I had a similar use case recently. You have to approach this understanding that it's different paradigm:    
	 You can't do I/Os the old fashion way; whatever dataset you're manipulating must be distributed; ie your log file should be on HDFS. So first step, opening the log file and creating a RDD would look something like this:    spark = SparkSession\
	.builder\
	.appName("CheckData")\
	.getOrCreate()
lines = spark.read.text("hdfs://[servername]/[path]/Virtual_Ports.log").rdd.map(lambda r: r[0])
  
 You don't programmatically iterate on the data per say, instead you supply a function to process each value (lines in this case). So your code where you iterate on lines could be put inside a function:   def virtualPortFunction(line):
#Do something, return output process of a line
virtualPortsSomething = lines.flatMap(lambda x: x.split(' ')) \
		             .map(lambda x: virtualPortFunction(x))
  This is very simplistic way to put it but this will give you a starting point if you decide to go down the PySpark route.   Also look at the pyspark samples part of the Spark distribution. Good place to start.   https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-09-2017
	
		
		01:59 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Assuming you're using HDP 2.5 sandbox, another option would be to deploy Zeppelin service in Amabari. The modules above are also included in Zeppelin.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-24-2016
	
		
		10:03 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Good point Tim.   Each "SQL on Hadoop" implementation obviously has pros and cons... general rule of thumbs:    SparkSQL --> good for iterative processing, access existing Hive tables, given results fits in memory  HAWQ --> good for "traditional" BI-like queries, star schemas, cubes OLAP  HIVE-LLAP --> good for petabyte scale mixed with smaller tables requiring sub-second queries.   Phoenix --> A good way to interact with HBase tables, good with time series, good indexing  Drill, Presto --> Query federation-like capabilities but limited SQL syntax. Performance varies quite a bit.   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-24-2016
	
		
		10:02 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Good point Tim.   Each "SQL on Hadoop" implementation obviously has pros and cons... general rule of thumbs:    SparkSQL --> good for iterative processing, access existing Hive tables, given results fits in memory  HAWQ --> good for "traditional" BI-like queries, star schemas, cubes OLAP  HIVE-LLAP --> good for petabyte scale mixed with smaller tables requiring sub-second queries.   Phoenix --> A good way to interact with HBase tables, good with time series, good indexing  Drill, Presto --> Query federation-like capabilities but limited SQL syntax. Performance varies quite a bit.   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-07-2016
	
		
		06:13 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Great article cduby. Thanks! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				







