Member since 
    
	
		
		
		11-16-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                195
            
            
                Posts
            
        
                36
            
            
                Kudos Received
            
        
                16
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 2816 | 10-23-2019 08:44 PM | |
| 2670 | 09-18-2019 09:48 AM | |
| 11571 | 09-18-2019 09:37 AM | |
| 2459 | 07-16-2019 10:58 AM | |
| 3336 | 04-05-2019 12:06 AM | 
			
    
	
		
		
		05-02-2018
	
		
		03:44 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @jirapong this is a known issue which we've recently seen in CDS 2.3  On Spark 2.3 the nativeLoader (SnappyNativeLoader’s) parentClassLoader is now an ExecutorClassLoader , whereas the parentClassLoader was a Launcher$ExtClassLoader prior to Spark 2.3. This created incompatibility with the snappy version (snappy-java.1.0.4.1) packaged with CDH.     We are currently working on a solution in a future release, but there are two workarounds:  1) Use a later version of the Snappy library, which works with the above-mentioned class loader change, for example, snappy-java-1.1.4.    Place the new snappy-java library on a local file system (for example /var/snappy). Then run your spark application with the user classpath options as shown below:  spark2-shell --jars /var/snappy/snappy-java-1.1.4.jar --conf spark.userClassspathFirst=true --conf spark.executor.extraClassPath="./snappy-java-1.1.4.jar"    2)  Instead of using Snappy, you can set the compression by changing the codec to LZ4 or UNCOMPRESSED (which you've already tested).    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-01-2018
	
		
		11:54 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 Thanks @Benassi10 for providing the context. Much appreciated.     We are discussing this internally to see what can cause such issues. One theory is that we enabled support for Spark Lineage in CDS 2.3 and if the cm-agent doesn't create /var/log/spar2/lineage directory (for some reasons) you can see this behaviour. If lineage is not important, can you try running the shell with lineage disabled?     spark2-shell  --conf spark.lineage.enabled=false     If you don't want to disable lineage, another workaround would be to change the lineage directory to /tmp  in CM > Spark2 > Configuration > GATEWAY Lineage Log Directory > /tmp , followed by redeploying the client configuration.     Let us know if the above helps. I will update the thread once I have more information on the fix. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-01-2018
	
		
		08:26 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Swasg  by any chance are you using the package name in the spark-shell? Something like   spark-shell --packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11-2.3.0     The error suggests that the format should be in the form of 'groupId:artifactId:version' but in your case it's 'groupId:artifactId-version'. If you are using the package in the command line or somewhere in your configuration, please modify it to:  org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.3.0    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-01-2018
	
		
		07:22 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Thanks for reporting. Care to share the full error for the lineage file missing, please? I quickly tested an upgrade from 2.2 to 2.3 but didn't hit this. A full error stack trace would certainly help. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-01-2018
	
		
		05:19 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		4 Kudos
		
	
				
		
	
		
					
							 @rams the error is correct as the syntax in pyspark varies from that of scala.      For reference here are the steps that you'd need to query a kudu table in pyspark2     Create a kudu table using impala-shell  # impala-shell   CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING)   PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU;   insert into test_kudu values (100, 'abc');   insert into test_kudu values (101, 'def');   insert into test_kudu values (102, 'ghi');      Launch pyspark2 with the artifacts and query the kudu table  # pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0  ____ __  / __/__ ___ _____/ /__  _\ \/ _ \/ _ `/ __/ '_/  /__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT  /_/  Using Python version 2.7.5 (default, Nov 6 2016 00:28:07)  SparkSession available as 'spark'.    >>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load()     >>> kuduDF.show(3)  +---+---+  | id| s|  +---+---+  |100|abc|  |101|def|  |102|ghi|  +---+---+     For records, the same thing can be achieved using the following commands in spark2-shell     # spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0  Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011).  Spark session available as 'spark'.  Welcome to  ____ __  / __/__ ___ _____/ /__  _\ \/ _ \/ _ `/ __/ '_/  /___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT     scala> import org.apache.kudu.spark.kudu._  import org.apache.kudu.spark.kudu._     scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu     scala> df.show(3)  +---+---+  | id| s|  +---+---+  |100|abc|  |101|def|  |102|ghi|  +---+---+    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-25-2018
	
		
		08:49 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Try this: http://site.clairvoyantsoft.com/installing-sparkr-on-a-hadoop-cluster/ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-15-2018
	
		
		11:17 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @hedy thanks for sharing.     The workaround you received makes sense when you are not using any cluster manager(?)     Local mode ( --master local[i] ) is generally seen if you want to test or debug something quickly since there will be only one JVM launched on the node from where you are running pyspark and this JVM will act as driver, executor, and master -> all-in-one. But of course with local mode, you lose the scalability and resource management that a cluster manager provides. If you want to debug why simultaneous spark shells are not working when using Spark-On-Yarn, we need to diagnose this from YARN perspective (troubleshooting steps shared in the last post). Let us know. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-11-2018
	
		
		06:05 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 If the question is academic in nature then certainly, you can.  If it's instead a use-case and if I were to choose between Sqoop and SparkSQL, I'd stick with Sqoop. The reason being Sqoop comes with a lot of connectors which it has direct access to, while Spark JDBC will typically be going in via plain old JDBC and so will be substantially slower and put more load on the target DB. You can also see partition size constraints while extracting data. So, performance and management would certainly be a key in deciding the solution.      Good Luck and let us know which one did you finally prefer and how was your experience. Thx 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-10-2018
	
		
		11:08 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Sorry, this is a bug described in SPARK-22876 which suggests that the current logic of spark.yarn.am.attemptFailuresValidityInterval is flawed.  While the jira is still being worked on, looking at the comments, I don't foresee a fix anytime soon.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-10-2018
	
		
		09:37 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							    WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.     ^ This generally means that the problem is beyond the port mapping ( i.e either with queue configuration/ available resources/YARN level).     Assuming that you are using spark1.6, I'd suggest to temporarily change the shell logging level to INFO and see if that gives a hint. The easy and quick way to do this would be to edit /etc/spark/conf/log4j.properties from the node you are running pyspark and modify the log level from WARN to INFO.     # vi /etc/spark/conf/log4j.properties 
shell.log.level=INFO
 
$ spark-shell  ....  18/04/10 20:40:50 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.  18/04/10 20:40:50 INFO util.Utils: Successfully started service 'SparkUI' on port 4041.  18/04/10 20:40:50 INFO client.RMProxy: Connecting to ResourceManager at host-xxx.cloudera.com/10.xx.xx.xx:8032  18/04/10 20:40:52 INFO impl.YarnClientImpl: Submitted application application_1522940183682_0060  18/04/10 20:40:54 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED)  18/04/10 20:40:55 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED)  18/04/10 20:40:56 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED)  18/04/10 20:40:57 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED)        Next, open the Resource Manager UI and check the state of the Application (i.e your second invocation of pyspark) -- whether it's is registered but just stuck in ACCEPTED state like this:               If yes, look at the Cluster Metrics row at the top of the RM UI page and see if there are enough resources available:               Now kill the first pyspark session and check if the second session changes the state RUNNING in the RM UI. If yes, look at the queue placement rules and stats in Cloudera Manager > Yarn > Resource Pools Usage (and Configuration)                  Hopefully, this would give us some more clues. Let us know how it goes? Feel free to share the screen-shots from the RM UI and spark-shell INFO logging. 
						
					
					... View more