Member since 
    
	
		
		
		09-06-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                108
            
            
                Posts
            
        
                36
            
            
                Kudos Received
            
        
                11
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3118 | 05-11-2017 07:41 PM | |
| 1601 | 05-06-2017 07:36 AM | |
| 8681 | 05-05-2017 07:00 PM | |
| 2865 | 05-05-2017 06:52 PM | |
| 7739 | 05-02-2017 03:56 PM | 
			
    
	
		
		
		08-21-2018
	
		
		01:20 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Can confirm the DBCPConnectionPool approach suggested here by @Rudolf Schimmel works. We did run into issues when using Java 10 (uncaught Exception: java.lang.NoClassDefFoundError: org/apache/thrift/TException even though libthrift was specified). Using Java 8 worked.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-19-2017
	
		
		01:49 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							  Ingest data with NIFI to Hive LLAP and Druid  Setting up Hive LLAP  Setting up Druid  Configuring the dimensions    Setting up Superset  Connection to Druid  Creating a dashboard with visualisations    Differences with ES/Kibana and SOLR/Banana  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		05-13-2017
	
		
		02:10 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi @Sushant,  You can control access to Yarn Queues, including who can kill applications, with access control lists (ACL's). Read more about this in the docs.  /W 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-12-2017
	
		
		06:45 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hmm. can't find anything obvious. Best to post a new questions on HCC for this, so it gets the proper attention. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-11-2017
	
		
		07:41 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi @PJ,  See https://issues.apache.org/jira/browse/HDFS-4239 for a good relevant discussion.  So shut down the datanode, clean the disk, remount and restart the datanode.  Because of the data replication factor of 3 from HDFS that shouldn't be a problem. Make sure the new mount is in the dfs.data.dir  config.  Additionally you can also decomission the node and recommission following the steps here:  https://community.hortonworks.com/articles/3131/replacing-disk-on-datanode-hosts.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-10-2017
	
		
		08:01 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 The standard solution  Let's say you want to collect log messages from an edge cluster with NIFI, and push it to a central NIFI cluster via the Site To Site (S2S) protocol. This is exactly what NIFI is designed for, and results in a simple flow setup like this:  
 A processor that tails the log file  which sends it's flowfiles to a remote process group which is configured with the FQDN URL of the central NIFI cluster   on the central NIFI cluster an INPUT port is defined  and from that input port the rest of the flow is doing it's thing with the incoming flow files, like filtering, transformations and eventually sinking it into kafka, HDFS or SOLR.  The NIFI S2S protocol is used for the connection between the edge NIFI cluster and the central nifi cluster.  which PUSHES the flowfiles from the edge cluster to the central NIFI cluster.    And now with a firewall blocking incoming connections in between  This standard setup however assumes the central NIFI cluster has a public FQDN and isn't behind a firewall blocking incoming connections. But what if there is a firewall blocking incoming connections? Fear not! The flexibility of NIFI comes to the rescue once again. 
The solution is to move the initiation of the S2S connection from the edge NIFI to central NIFI:  
 The remote process group in defined on the central node,   which connects to a output port on the edge node  as the edge NIFI node has a public FQDN (this is required!)  and instead of a S2S PUSH, the data is effectively PULLED from the edge NIFI cluster to the central NIFI cluster.   To be clear: this setup has the downside that the central cluster NIFI will need to know about all edge clusters. Not necessarily a big deal, just means the flow in the central NIFI cluster needs to be updated when edge clusters/nodes are added. But if you can't change the fact you have a firewall blocking incoming connections, it does the job.  Example solution NIFI flow setup  Screenshot of flow on Edge Node with a TailFile processor that send it's flowfiles to the output port named `logs`:      Screenshot of flow on central NIFI cluster with a remote process group pointed to the FQDN of the Edge Node and a connection from the output port `logs` to the rest of the flow:      The configuration of the remote process group:      And the details of the `logs` connection:     
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		05-09-2017
	
		
		06:52 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							  Great that you where able to solve it! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-09-2017
	
		
		09:22 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 
	To get an idea of the write performance of a Spark cluster i've created a Spark version of the standard 
	TestDFSIO tool, which measures the I/O performance of HDFS in your cluster. Lies, damn lies and benchmarks, so the goal of this tool is providing a sanity check of your Spark setup, focusing on the HDFS writing performance, not on the compute performance. Think the tool can be improved? Feel free to submit a pull request or raise a Github issue  Getting the Spark Jar  
	Download the Spark Jar from here: 
	https://github.com/wardbekker/benchmark/releases/download/v0.1/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar  
	It's build for Spark 1.6.2 / Scala 2.10.5  Or build from from source  
	 
	$ git clone https://github.com/wardbekker/benchmark    
	 $ cd benchmark && mvn clean package   Submit args explains  
	 <file/partitions>  : should ideally be equal to recommended spark.default.parallelism (cores x instances).  
	 <bytes_per_file>  : should fit in memory: for example: 90000000.  
	 <write_repetitions>  : no of re-writing of the test RDD to disk. benchmark will be averaged. 
 spark-submit --class org.ward.Benchmark  --master yarn --deploy-mode cluster --num-executors X --executor-cores Y --executor-memory Z target/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar <files/partitions> <bytes_per_file> <write_repetitions>
  CLI Example for 12 workers with 30GB mem per node:  It's important to get the amount of executors and cores right: you want to get the maximum amount of parallelism without going over the maximum capacity of the cluster.   
	This command will write out the generated RDD 10 times, and will calculate an aggregate throughput over it. 
 spark-submit --class org.ward.Benchmark  --master yarn --deploy-mode cluster --num-executors 60 --executor-cores 3 --executor-memory 4G target/benchmark-1.0-SNAPSHOT-jar-with-dependencies.jar 180 90000000 10
  Retrieving benchmark results:  You can retrieve the benchmark results by running yarn log in this way: 
 yarn logs -applicationId <application_id> | grep 'Benchmark'
  
	for example: 
 Benchmark: Total volume         : 81000000000 Bytes
Benchmark: Total write time     : 74.979 s
Benchmark: Aggregate Throughput : 1.08030246E9 Bytes per second
  So that's about 1 GB write per sec for this run. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		05-08-2017
	
		
		02:59 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Arpit Agarwal good point. The customer uses ranger audit logging. What extra information is in the hdfs audit log, what is not already in the ranger audit logs. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













