Member since 
    
	
		
		
		06-02-2023
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                12
            
            
                Posts
            
        
                0
            
            
                Kudos Received
            
        
                0
            
            
                Solutions
            
        
			
    
	
		
		
		08-10-2023
	
		
		03:28 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							  Hi,  In the CML, the sessions in the PBJ Workbench show all infos and warnings in red by default (with a little symbol). Is there a way to set the color to black?      Kind regards!    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			Cloudera Machine Learning (CML)
 
			
    
	
		
		
		08-04-2023
	
		
		04:21 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 In other words: Why does this option not work:  Web session timeouts (cloudera.com) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-04-2023
	
		
		02:24 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Correction: it is not just PBJ. Also for rstudio sessions. So the congif setting was not taken. What could we do? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-04-2023
	
		
		12:35 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi,  We newly upgraded to CML (previously CDSW). The jupyterlab-runtime sessions go into a timeout according to the value set in the configuration - but the PBJ-runtime sessions not. Right now we assume that the PBJ sessions just don't go into some sort of idle state. Could that be? What could prevent it from going idle? How can we make that the user sessions are timeouted after the time set in the config?     Thank you in advance. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			Cloudera Machine Learning (CML)
 
			
    
	
		
		
		06-23-2023
	
		
		01:23 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thank you @haridjh ! It worked! I am even further confused because the underlying parquet file is already partitioned. But when inserting a repartition() the code works! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-20-2023
	
		
		10:48 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Does anyone have an idea? Otherwise this is a block for working with larger datasets. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-19-2023
	
		
		01:50 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi,  I am hitting a kryoserializer buffer issue with the following simple line in PySpark (Spark version 2.4):  df=ss.read.parquet(data_dir).limit(how_many).toPandas()     Thus I am reading a partitioned parquet file in, limit it to 800k rows (still huge as it has 2500 columns) and try to convert toPandas.   Kryoserializer.buffer.max is already at the maximum possible value:  .config("spark.kryoserializer.buffer.max", "2047m")      What other ways are there to make it run (except of reducing the amount of rows even further down)? While in stage1 he is making many steps (approx 157), in stage2 he has only one step - and thus tries to juggle with a very large object.      The following in an excerpt of the error message:     Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 158, localhost, executor driver): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 78352248. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:331)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:461)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 78352248
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
	at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)
	at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38)
	at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:629)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:332)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:328)        Edit1: After reducing howmany to 400k (from 800k as shown above) it also gives the exact same error (see below). This confuses me a bit: Why is the required amount the same?   Buffer overflow. Available: 0, required: 78352248     Edit2: What also confuses me: There are three stages in spark for the above. The first one is clear: A very quick "parquet at NativeMethodAccessorImpl.java:0". In the third stage it crashes (and there is only one task there, so that it cannot be calculated in partitions or so). But my confusion is in the second stage: It results in "Input Size / Records" of 14.7 GB / 1922984 and in "Shuffle Write Size / Records" of 25.4 GB / 1922984. In other words it calculated here all of the rows of the file (approx 1.9 mio rows). The limit() function had no effect (?). 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			Apache Spark
 
			
    
	
		
		
		06-14-2023
	
		
		01:06 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Fun fact for those interested: In order to have 8 cores running you need in this example minimum 64m as xss options. If you chose 32m, then it will not give a stackoverflow error, but only 4 cores will be running 😲 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-13-2023
	
		
		10:18 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I tried to include a cache(), but it still takes that long:  data = ss.range(rows).cache()     I also reduced the following to 32 thinking that I overdid it now maybe with java stack size, but still the same effect  .config("spark.driver.extraJavaOptions", "-Xss32m") \  .config("spark.executor.extraJavaOptions", "-Xss32m") \     @RangaReddy do you have an idea what I am doing wrong?     Edit: I see in the logs the following - is cached actually working or does it show it only at the beginning?  "Storage Level":{"Use Disk":false,"Use Memory":false,"Deserialized":false,"Replication":1}, 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-13-2023
	
		
		01:19 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi,  This is a reproducible, simple issue where the performance is surprisingly bad. It is a follow-up to the case under this link, where initially a stoackoverflow issue occurred.     The script below ran for 26 hours in over 8 cores at full calculation as is seen in the hardware statistics.      Of course the object is "quite large"- but in similar operations with such an object it does not take so long. The generated size was 42.5GB in the 8 parquet files on HDFS.      Here is the code:  from pyspark.sql import SparkSession  from pyspark.sql.functions import rand    ss = SparkSession.builder.appName("test_replication") \  .config("spark.kryoserializer.buffer.max.mb", "2047") \  .config('spark.sql.execution.arrow.pyspark.enabled', "true") \  .config("spark.driver.maxResultSize", "16G") \  .config("spark.driver.memory", "4G") \  .config("spark.executor.memory", "16G") \  .config("spark.dynamicAllocation.maxExecutors","8") \  .config("spark.executor.instances", "2") \  .config("spark.executor.cores", "4") \  .config("spark.dynamicAllocation.enabled", "true") \  .config("spark.driver.extraJavaOptions", "-Xss1024m") \  .config("spark.executor.extraJavaOptions", "-Xss1024m") \  .config("spark.yarn.tags","dev") \  .getOrCreate()  rows=2350000  cols=2500  hdfs_dir="/destination/on/hdfs"    data = ss.range(rows)  for i in range(cols):  data=data.withColumn(f'col{i}', rand() * 2 -1)  data.write.format("parquet").mode("overwrite").save(f"{hdfs_dir}/test.parquet")     Am I doing something wrong?     Edit: I see in the log of the applicationHistory the following element that surprises me - is this normal?  "Storage Level":{"Use Disk":false,"Use Memory":false,"Deserialized":false,"Replication":1},   Edit2: Is this due to no cache() or persist() in place? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			Apache Spark
 - 
						
							
		
			Apache YARN