Member since 
    
	
		
		
		09-26-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                135
            
            
                Posts
            
        
                85
            
            
                Kudos Received
            
        
                26
            
            
                Solutions
            
        About
	
            Steve's a hadoop committer mostly working on cloud integration
        
My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3457 | 02-27-2018 04:47 PM | |
| 5928 | 03-03-2017 10:04 PM | |
| 3554 | 02-16-2017 10:18 AM | |
| 1883 | 01-20-2017 02:15 PM | |
| 11882 | 01-20-2017 02:02 PM | 
			
    
	
		
		
		12-14-2015
	
		
		07:30 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							  It's actual title is "Hadoop and Kerberos: The Madness Beyond the Gate" —there's an HP Lovecraft theme of "forbidden knowledge which will drive you insane" which is less a joke and more commentary.  it's actually rendered on gitbook  If you are working with Kerberos, get a copy of the O'Reilly Hadoop Security book too. My little e-book was written to cover the bits that was left out: to extend rather than replace.  Finally, being open source: contributions are welcome  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-14-2015
	
		
		07:26 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 thank you. View it as working notes to avoid me having to send emails to colleagues trying to understand things. And being working notes, it only covers the problems I've encountered. There are many more out there, and in fact I am having serious problems with Kerberos right now which have even me defeated. So don't expect it to solve all your problems. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-13-2015
	
		
		03:41 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I must disagree. Dedicating machines via labels is not always the right choice. Imagine you give 20 nodes in a 100 node cluster the label "spark", with only spark-queue work able to run on it. When there's no work on that queue: the machines are idle. When there is work in the queue, it'll only get run on those 20 nodes.  There's also replication & data locality to consider: if the data you need isn't on one of those 20 nodes, it'll be remote —which can also hurt performance.  You really need to look at the cluster and workload to make a good choice 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-13-2015
	
		
		03:35 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 If you are running spark applications on a YARN cluster then you do not need to directly allocate memory or machines to it.   You can dedicate machines via labels, either for exclusive workloads or   to handle heterogenous hardware better. If there is some application where latency and the ability to respond immediately to spikes in load matters, then dedicated labels work. For example; HBase in interactive applications. If different parts of the cluster have different hardware configurations (example: RAM, GPU, SSD for local storage), then labels helps you schedule jobs which need those features to only be executed on those machines  Once you start using labels, the labelled hosts will be underutilized when that specific work isn't running: the permanent tradeoff.   If you are just running queries on a cluster where that latency isn't so critical that you want to pre-allocate capacity on isolated machines, —then using queues makes is more efficient.  You can also set up queue priorities and pre-emption, so your important spark queries can actually pre-empt (i.e. kill) ongoing work from lower-priority applications.
  What is important for Spark is having your jobs ask for the memory they really need: Spark likes a lot, and if the spark JVM/python code consumes more than was allocated to them in the Yarn container requests, the processes may get killed.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-12-2015
	
		
		02:09 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 There isn't really much in the way of Ceph integration. There is a published filesystem client JAR which, if you get on your classpath, should let you refer to data using ceph:// as the path. You also appear to need its native lib on the path, which is a bit trickier.  This comes from the Ceph team, not the Hadoop people, and   1. I don't know how up to date/in sync it is with recent Hadoop versions.  2. It doesn't get released or tested by the Hadoop team: we don't know how well it works, or how it goes wrong.  Filesystems are an interesting topic in Hadoop. Its a core critical part of the system: you don't want to lose data. And while there's lots of support for different filesystem implementations in hadoop (s3n, avs, ftp , swift: file:), HDFS is the one things are built and tested against. Object stores (s3, swift) are not real filesystems, and cannot be used in place of HDFS as the direct output of MR, Tez or spark jobs; and absolutely never to run HBase or accumulo atop.  I don't know where ceph fits in here. It's probably safe to use it as a source of data; it's as the destination where the differences usually show up.  Finally: HDP is not tested on Ceph, so cannot be supported. We do test on HDFS, against Azure storage (in HD/Insight), and on other filesystems (e.g. Isilon). I don't know of anyone else who tests Hadoop on Ceph, the way, say Redhat do with Gluster FS. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-12-2015
	
		
		02:01 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		4 Kudos
		
	
				
		
	
		
					
							 This looks like it's being triggered by the Spark -> timeline server integration, as ATS is going OOM when handling spark events.  Which means its my code running in the spark jobs triggering this.  What kind of jobs are you running? Short lived? Long-lived? Many executors?   The best short-term fix is for you to disable the timeline server integration, and set the spark applications up to log to HDFS instead, with the history server reading it from there.  The details of this are covered in Spark Monitoring  1. In the spark job configuration you need to disable the ATS publishing.  Find the line  spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService 
 -delete it  set the property spark.history.fs.logDirectory to an HDFS directory which must be writeable by everyone. For example, hdfs://shared/logfiles   spark.eventLog.enabled  true
  spark.eventLog.compress  truespark.history.fs.logDirectoryhdfs://shared/logfiles
  2. In the history server you need to switch to the filesystem log provider  spark.history.providerorg.apache.spark.deploy.history.FsHistoryProviderspark.history.fs.logDirectoryhdfs://shared/logfiles
  The next spark release we'll have up for download (soon!) will log less events to the timeline server. Hopefully it will reduce the problems on the timeline server. There's also lots of work going on in the timeline server for future hadoop versions to handle larger amounts of data —by mixing stuff kept in HDFS with the leveldb data.  for now, switching to the filesystem provider is your best bet 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-09-2015
	
		
		07:17 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 A key one is straightforward: HDFS is where the data is. YARN schedules work by that data. YARN clusters are very widely deployed, Spark on YARN lets you run Spark queries against that cluster without you even needing to ask permissions from the cluster opts team. To them, it's just another client job. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-09-2015
	
		
		07:14 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Note that Spark 1.5+ is needed for spark jobs of duration > 72h not to fail when their kerberos tickers expire. And you'll need to supply a keytab which the Spark AM can renew tickets with. For short-lived queries, this problem should not surface 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-09-2015
	
		
		01:42 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Note also you are going to get less IO bandwidth, as you move from 3 replicas (and hence 3 places to run code locally), to what is essentially a single replica, with the data spread across the network.  Erasure coding is for best storing cold data where the improvements in storage density is tangible: it will hurt performance through   -loss of locality (network layer)   -loss of replicas (disk IO layer)  -need to rebuild the raw data (CPU overhead)  I don't think we have any figures yet on the impact.  On a brighter note, 10GbE ToR switches are falling in price, so you could thing about going to 10 Gb on-rack, even if the backbone remains a bottleneck 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-30-2015
	
		
		06:14 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Primarily so that Ambari can use it to deploy and manage things via slider; it doesn't need to be installed on other machines in the cluster. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













