Member since 
    
	
		
		
		09-15-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                116
            
            
                Posts
            
        
                141
            
            
                Kudos Received
            
        
                40
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 2232 | 02-05-2018 04:53 PM | |
| 3086 | 10-16-2017 09:46 AM | |
| 2480 | 07-04-2017 05:52 PM | |
| 3879 | 04-17-2017 06:44 PM | |
| 3096 | 12-30-2016 11:32 AM | 
			
    
	
		
		
		05-17-2016
	
		
		05:29 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 You can do this with YARN ACLs against the relevant yarn queues. A better way of implementing this would be to use Ranger to set YARN access controls to a given set of queues. You would then require spark users to add queue to their submit job, and the authentication would be enforced. To ensure this, prevent user access to the default queue.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-04-2016
	
		
		11:58 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 The NiFi processor does in fact use the elastic search BulkRequest java api, so in fact, even if you set the batch size to 1, you will be using batch loading from the ES perspective.  If you want to send in a file with multiple JSON records, you have two choices. You could either use InvokeHttp to post to the REST API.  For a more NiFi centric solution, use SplitText to divide up the JSON records, and then process these with a decent BatchSize using the PutElasticsearch processor. This will give you good control, especially if you need to use back-pressure to even out the flow and protect the ES cluster.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-18-2016
	
		
		04:32 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Oregon is US_WEST_2 (per http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-15-2016
	
		
		05:17 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 While it is quite convenient from an API perspective, Spark is a very heavy solution for Inferring schema from individual CSV and JSON files, unless they are very large.   A better solution to this would be to use NiFi to infer schema. The latest version of HDF includes the InferAvroSchema processor. This will take CSV or JSON files and attach an Avro Schema to the flow file as an attribute. You can then use this with Convert processors to get schema based data into a Database, or onto HDFS for example. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-08-2016
	
		
		03:41 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 To configure log levels, add   --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.properties" 
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.properties" 
  This assumes you have a file called log4j-spark.properties on the classpath (usually in resources for the project you're using to build the jar. This log4j can then control the verbosity of spark's logging.   I usually use something derived from the spark default, with some customisation like:  # Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.sql=WARN
# Logging for this application
log4j.logger.com.myproject=INFO
  Something else to note here is that in yarn cluster mode, all your important logs (especially the executor logs) will be aggregated by the YARN ATS when the application finishes. You can get these with   yarn logs -applicationId <application>  This will show you all the  log based on your config levels. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-04-2016
	
		
		12:54 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 Exactly as you would in a hive CTAS:  %spark 
sc.parallelize(1 to 10).toDF.registerTempTable("pings_raw")
%sql 
create table pings 
   location '/path/to/pings'
as select * from pings_raw
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-31-2016
	
		
		11:59 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 The --jars option is comma delimited, not colon delimited. Hence the error: it says it can't find a single file with a very long name. Change your ':' to ',' and you should get further. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-30-2016
	
		
		09:30 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 GraphX works by loading an entire graph into a combination of VertexRDDs and EdgeRDDs, so the underlying database's capabilities are not really relevant to the graph computation, since GraphX won't touch it beyond initial load.   On that basis you can really use any thing that will effectively store and scan a list of paired tuples, and a list of ids and other properties. From this perspective HBase or Accumulo would seem like a good bet to attach Spark to, but of course any file in HDFS would do.   For the ability to modify a graph prior to analysing it in GraphX, it's more useful to pick a 'proper' graph database. For this it's worth looking at something like Accumulo Graph which provides a graph database hosted on Accumulo, or possibly another very exciting new project, Gaffer (https://github.com/GovernmentCommunicationsHeadquarters/Gaffer) also hosted on Accumulo. Tinkerpop (https://tinkerpop.apache.org/) provides some other options focussed around the common Gremlin APIs, which are generally well understood in graph world. Other options might include something like TitanDB hosted on HBase. These will provide you with an interface API for modifying graphs effectively.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-29-2016
	
		
		02:39 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Ultimately in Spark if you want to write to different tables you will probably end up writing different DataFrames / RDDs from the filter. Alternatively, you could write to multiple Hbase tables based on a column within a mapPartitions call, but you may have issues with achieving nice batch sizes for each Hbase write. That way you can have a generic HBase function that is parameterised by column.  If your use case is simply routing a stream of records from another source to paramterised locations in Hbase, NiFi may actually be a better fit, but if you are doing significant or complex batch processing before landing into HBase the Spark approach will help.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-24-2016
	
		
		02:06 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 
	RDDs do not really have fields per-se, unless for example your have an RDD of Row objects. You would usually filter on an index:  
	rdd.filter(x => x(1) == "thing")  
	(example in scala for clarity, same thing applies to Java)  
	If you have an RDD of a typed object, the same thing applies, but you can use a getter for example in the lambda / filter function.  
	If you want field references you would do better wih the dataframes API. Compare for example:  
	sqlContext.createDataFrame(rdd, schema).where($"fieldname" == "thing")  
(see Spark Official Docs - DataFrames for more on the schema) 
						
					
					... View more