Member since 
    
	
		
		
		01-15-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                82
            
            
                Posts
            
        
                29
            
            
                Kudos Received
            
        
                10
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 8615 | 04-03-2017 09:35 PM | |
| 5431 | 12-29-2016 02:22 PM | |
| 1729 | 06-27-2016 11:18 AM | |
| 1318 | 06-21-2016 10:08 AM | |
| 1476 | 05-26-2016 01:43 PM | 
			
    
	
		
		
		04-05-2017
	
		
		08:29 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							@tuxnet it should work with spark 1.6 as well. You can check master url in spark-defaults.config file in your cluster. If you setup SPARK_CONF_DIR variable and copy spark-defaults config from your cluster to it there is no need to specify master explicitly. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-03-2017
	
		
		09:35 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @tuxnet  Sure you can use any IDE with PySpark.   Here is short instructions for Eclipse and PyDev:  - set HADOOP_HOME variable referncing location of winutils.exe  - set SPARK_HOME variable referencing your local spark folder   - set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)  - add %SPARK_HOME%/python/lib/pyspark.zip and   %SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter  For the testing purposes i'm adding code like:  spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..  but with proper configuration file in SPARK_CONF_DIR it should work with just SparkSession.builder.getOrCreate()  Alternatively you can setup your run configurations to use spark-submit directly.  Hope it helps 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-15-2017
	
		
		03:52 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Cord thomas
  Turn on debug logging and check the log file first
  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-29-2016
	
		
		03:51 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @vamsi valiveti it could be the option, right.  But for production usage i'd think additionally about how to stop the agents and how to monitor the agent. From my experience init.d service script + ganglia monitoring is a best option.  It allows you to run/stop agents easily with the commands like: /etc/init.d/flume "agent" stop/start. And ganglia provides a nice web interface for the monitoring. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-29-2016
	
		
		02:40 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @vamsi valiveti the easiest way is to detach shell from the command using nohup:  nohup <my_command> &
  Another option is to create flume init.d service script. I've posted some example script here (search for "Setup flume agent auto startup" on the page), and run the flume as a service.  And third option is to use Ambari to control the agents.
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-29-2016
	
		
		02:22 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @vamsi valiveti you can trigger Flume from the oozie shell action. However pay attention that action will be executed on random cluster node, so all your nodes should have Flume installed. Also you will need to somehow control the agents after that, and if you have >10 nodes it became a problem.. That's why is not common scenario of flume usage.  I'd say the good approach is to keep Flume running all the time. And schedule oozie jobs to process the data whenever you need. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-26-2016
	
		
		02:58 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Default transactionCapacity for file channel is 10 000. For memory channel - 100  Thats why it works for you. Add transactionCapacity property to your file channel or increase memory available for flume process (like -Xmx1024m) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-27-2016
	
		
		11:18 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Grant write permissions to /var/log/flume directory. Also you can specify alternative log file for specific agent:  -Dflume.log.file=my_path/my_file.log 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-21-2016
	
		
		10:08 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 High availability in flume is just a matter of agents configuration regardless if you're using Ambari or not.  Here few links you can check:
  https://flume.apache.org/FlumeUserGuide.html#flow-reliability-in-flume  https://flume.apache.org/FlumeUserGuide.html#failover-sink-processor 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-20-2016
	
		
		04:15 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 I'd say whenever you need some Spark specific features like ML, GraphX or Streaming - use spark as ETL engine since it provides All-in-one solution for most usecases.  If you have no such requirements - use Hive on TEZ  If you have no TEZ - use Hive on MR  In any case Hive acts just like a metastore.. 
						
					
					... View more