Member since 
    
	
		
		
		10-19-2014
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                58
            
            
                Posts
            
        
                6
            
            
                Kudos Received
            
        
                2
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 6391 | 03-20-2016 10:41 PM | |
| 11689 | 04-26-2015 02:30 AM | 
			
    
	
		
		
		08-14-2016
	
		
		06:07 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks Harsh for confirming there is no external schema file concept in Parquet and for sharing the link for CREATE TABLE ... LIKE PARQUET ... syntax.     This seems to be specific to Impala however, is there a generic approach to use across a stack of tools including Spark, Pig, Hive as well as Impala (and with Spark and Pig not using HCatalog)?     Many thanks,  Martin    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-14-2016
	
		
		03:00 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi, 
 , in a similar way to Avro with avsc schema files which can be referenced in CREATE TABLE statements? 
   
 Thanks, 
 Martin 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			HDFS
 
			
    
	
		
		
		07-27-2016
	
		
		01:23 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							Hi sairamvj, I would suggest you open a new thread for your question, as it is not related to this topic of this thread.    Martin
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-15-2016
	
		
		09:04 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hello,  What is the right way to pass the -no_multiquery option to Pig from Oozie workflow developed in Hue?     Thanks,  Martin 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			Apache Oozie
 - 
						
							
		
			Apache Pig
 - 
						
							
		
			Cloudera Hue
 
			
    
	
		
		
		04-06-2016
	
		
		08:59 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hello,     Here's a our scenario:   Data stored in HDFS as Avro  Data is partitioned and there are approx. 120 partitions  Each partition has around 3,200 files in it  The file sizes vary, as small as 2 kB and up to 50 MB  In total there is roughly 3 TB of data  (we are well aware that such data layout is not ideal)   Requirement:   Run a query against this data to find a small set of records, maybe around 100 rows matching some criteria      Code:     import sys
from pyspark import SparkContext
from pyspark.sql import SQLContext
if __name__ == "__main__":
	sc = SparkContext()
	sqlContext = SQLContext( sc )
	df_input = sqlContext.read.format( "com.databricks.spark.avro" ).load( "hdfs://nameservice1/path/to/our/data" )
	df_filtered = df_input.where( "someattribute in ('filtervalue1', 'filtervalue2')" )
	cnt = df_filtered.count()
	print( "Record count: %i" % cnt )  Submit the code:     spark-submit --master yarn --num-executors 50 --executor-memory 2G --driver-memory 50G --driver-cores 10 filter_large_data.py  Issue:   This runs for around many hours without producing any meaningful output. Eventually it crashes either with GC error, disk out of space error, or we are forced to kill it.   We've played with different values for the --driver-memory setting, up to 200 GB. This resulted in the program running for over six hours at which point we killed it.  Corresponding query in Hive or Pig would take around 1.5 - 2 hours   Question:   Where are we going wrong? 🙂      Many thanks in advance,  Martin 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			Apache Hive
 - 
						
							
		
			Apache Pig
 - 
						
							
		
			Apache Spark
 - 
						
							
		
			Apache YARN
 - 
						
							
		
			HDFS
 
			
    
	
		
		
		03-20-2016
	
		
		10:41 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							Answering my own question, found this: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_avro.html    dzimka, hope this works for you too.
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-16-2015
	
		
		11:09 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hello,  Does anyone have any concrete examples how to use the HDFS file concatenation functionality introduced in HDFS-222?     Thanks in advance,  Martin    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			HDFS
 
			
    
	
		
		
		05-14-2015
	
		
		04:06 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks for the response, unfortunatelly I am none the wiser 😐     Specifically I would want to run a shell action as another user. What we observe is that shell actions are not run as the user who logged in to Hue, rather they run under user "yarn".     Is there any way to get shell actions to run as another user?     Thanks,  Martin 
						
					
					... View more