Member since 
    
	
		
		
		09-26-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                135
            
            
                Posts
            
        
                85
            
            
                Kudos Received
            
        
                26
            
            
                Solutions
            
        About
	
            Steve's a hadoop committer mostly working on cloud integration
        
My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3466 | 02-27-2018 04:47 PM | |
| 5931 | 03-03-2017 10:04 PM | |
| 3555 | 02-16-2017 10:18 AM | |
| 1885 | 01-20-2017 02:15 PM | |
| 11904 | 01-20-2017 02:02 PM | 
			
    
	
		
		
		12-24-2015
	
		
		10:56 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hue has something behind the scenes called Livy, which is a little REST server doing the work...they haven't teased that out and made it standalone which is a shame. There's actually something very interested starting in the apache incubator, IBM's Spark Kernel code (which will be renamed during the incubation process)..this lets you wire up Jupyter directly, but also offers the ability to upload code callbacks into the spark cluster itself. I think that's pretty nice, and will be keeping an eye on it —though I don't know when it will be ready for broad use. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-24-2015
	
		
		10:52 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 that doc is a bit confusion: I read it myself and wasn't too sure. I've file a JIRA on reviewing and updating it.  Bearing in mind the python agent-side code is not something I know my way around, I think that comment about hostname:port is actually describing how site configurations can be built up.  I believe that python installation code running in a container can actually push out any quicklink values it wants. Client apps do have to be aware that (a) that data isn't there until the container is up and running, (b) after failover the outdated entries will hang around until replaced 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-23-2015
	
		
		12:35 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I thought that on a secure cluster zeppelin can only make queries as the user hosting the web ui...though I'm not sure there.  Spark SQL doesn't do user authentication in general, not via the thrift server (JBDC and especially ODBC). Nor does it do column-level access control as Hive does. It's just going straight at the files themselves. So it's not that locked down. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-19-2015
	
		
		05:18 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 if its at networking, just download the JAR file yourself, and use the --jars option to add it to the classpath.  looks like it lives under https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.1.0/ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-18-2015
	
		
		08:28 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 it won't; java doesn't look at the OS proxy settings. (there's a couple of exceptions, but they don't usually surface in a world where applets are disabled) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-18-2015
	
		
		08:27 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 if you use the s3a:// client, then you can set fs.s3a.proxy settings (host, port, username, password, domain, workstation) to get through.  See https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-17-2015
	
		
		12:05 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I'd put that down to DNS being in a mess or you not having a principal for the form service/host@REALM for the host in question.  See: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/errors.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-17-2015
	
		
		11:53 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I find my ZK logs end up under /var/log/zookeeper , at least with the HD installations. Make sure that the log directory has the permissions to be written to by the ZK account; if it doesn't you won't see logs 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-17-2015
	
		
		11:49 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		9 Kudos
		
	
				
		
	
		
					
							 The hdfs fsck operation doesn't check blocks for corruption; that takes too long. It looks at the directory structures alone.  Blocks are checked for corruption whenever they are read; there are little CRC checksum files created for parts of a block which are validated on read() operations. If you work with the file:// filesystem you can see these same files in your local FS. If a block is found to be corrupt on a read, the dfs client will report this to the namenode, and ask for another block, which will be used instead. As Chris said, the namenode then schedules the uncorrupted block for re-replication, as if it was under replicated. The corrupted block doesn't get deleted until that replication succeeds. Why not? If all blocks are corrupt, then maybe you can salvage something from all the corrupt copies of the block.  Datanodes scan all files in the background —they just do it fairly slowly by default so that applications don't suffer. The scan ensures that corrupted blocks are usually found before programs read them, and so that problems with "cold" data are found at all. It's designed to avoid the problem of all replicas getting corrupted and you not noticing until its too late to fix.  Look in the HDFS XML description for the details on the two options you need to adjust  dfs.datanode.scan.period.hours
dfs.block.scanner.volume.bytes.per.second  How disks fail/data gets corrupt is a fascinating problem. Here are some links if you really want to learn more about it   Did you really want that data (an old presentation of mine)  Failure Trends in a Large Disk Drive Population (google)  A Large-Scale Study of Flash Memory Failures in the Field (a recent facebook paper on Flash failures -shows they are less common than you'd fear)   I'd also recommend you look at some of the work on memory corruption -that's enough to make you think that modern laptops and desktops should be using ECC RAM. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-14-2015
	
		
		07:34 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 the .hwx version is one which has a security fix in; no other bug; it's not published to the maven central repo so not easy to pick up. We do have a repo which has it, but I think it's some internal one whose URLs don't resolve.  you can build with the normal one just by going -Djetty.version=6.1.26 
						
					
					... View more