Member since 
    
	
		
		
		07-29-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                535
            
            
                Posts
            
        
                141
            
            
                Kudos Received
            
        
                103
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 7711 | 12-18-2020 01:46 PM | |
| 5038 | 12-16-2020 12:11 PM | |
| 3840 | 12-07-2020 01:47 PM | |
| 2503 | 12-07-2020 09:21 AM | |
| 1633 | 10-14-2020 11:15 AM | 
			
    
	
		
		
		04-28-2016
	
		
		08:48 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 It looks like it's some kind of benchmark data? Is it a publicly available benchmark. It seems strange that they would have such skewed data.     I think even if it did stay in memory it would run for an incredibly long time: if 2 billion keys in one table get matches to 2 billion keys in another table then you will get 10^18 output rows (over a quintillion). So I think either there is something strange with the benchmark data or it doesn't make sense to join the tables.     We don't have the exact join algorithm documented aside from in the Impala code. It's a version of the Hybrid hash join https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-27-2016
	
		
		04:00 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 It looks like you have extreme skew in the key that you're joining on (~2 billion duplicates). The error message is:     "Cannot perform hash join at node with id 2. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 8. Number of rows 2052516353."     In order for the hash join to join on a key all the values for that key on the right side of the join need to be able to fit in memory. Impala has several ways to avoid this problem, but it looks like you defeated them all.     First, it tries to put the smaller input on the right side of the join, but both your inputs are the same size, so that doesn't help.     Second, it will try to spill some of the rows to disk and process a subset of those.     Third, it will try to repeatedly split ("repartition") the right-side input based on the join key to try and get it small enough to fit in memory. Based on the error message, it tried to do that 8 times but still has 2 billion rows in one of the partitions, which probably means there are 2 billion rows with the same key. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-22-2016
	
		
		09:21 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Are you sure it's Impala that's triggering it? I don't think Impala would use du for anything.     HDFS apparently does and Cloudera Manager might use it.     Have you tried tracing back what is running 'du'? E.g. run "ps auxf" to get a tree-view of processes. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-22-2016
	
		
		09:16 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 This looks like the canonical JIRA for the problem: https://issues.cloudera.org/browse/IMPALA-2184    If you can't upgrade a major version, the JIRA says the fix is also being backported to Impala 2.3.4 (CDH 5.5.4), so if you upgrade to 2.3.4 when that is released, you will get the fix. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-09-2016
	
		
		10:44 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 If you set num_nodes=1 that will force it to run on a single node. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-08-2016
	
		
		05:58 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 If the query runs with multiple fragments e.g. on 5 nodes you can get one file per fragment. If you look at the query plan (i.e. explain <the query text>) it will show the degree of parallelism for the insert. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-03-2016
	
		
		09:46 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 The Hive UDF won't help you. If you look at the Hive issue tracker https://issues.apache.org/jira/browse/HIVE-1304, row_sequence() was added as a workaround because they didn't support the row_number() analytic function at that point in time.     We support the row_number() analytic function in Impala, so there's no reason to try to use that UDF.     If you want to start froma particular number, can't you just add it like I suggested in my previous answer? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-01-2016
	
		
		07:12 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 You could possibly use the ROW_NUMBER() analytic function as part of the solution.     http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_analytic_functions.html#row_number_unique_1        E.g.     select t2.last_id + 1 + row_number() over (order by table1.col1), table1.col1  from table1     inner join (select max(id) last_id from table2) t2     This wouldn't be safe if you are running multiple insert concurrently. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-23-2016
	
		
		08:39 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							  Hi,    There are many possible variables, including the exact version of impala, the operating system it was built on, the build flags and environment variables, and what version/build of dependencies you're using.     I think the specific thing you're probably seeing with file sizes is that in the CDH distribution the debug symbols are stripped from the binaries and included in separate impalad.debug files.    Are you running into some error when trying to run your custom build of Impala? It probably makes more sense to debug that problem rather than trying to exactly reproduce Cloudera's build. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-27-2016
	
		
		04:09 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 You are most likely running into this bug with the aggregation: https://issues.cloudera.org/browse/IMPALA-2352     We fixed it in CDH5.5/Impala 2.3 but the change wasn't backported because it was deemed too risky for a maintenance release.    
						
					
					... View more