About cirrus

cirrus · ‎08-10-2023

Hi, In the CML, the sessions in the PBJ Workbench show all infos and warnings in red by default (with a little symbol). Is there a way to set the color to black? Kind regards!

cirrus · ‎08-04-2023

In other words: Why does this option not work: Web session timeouts (cloudera.com)

cirrus · ‎08-04-2023

Correction: it is not just PBJ. Also for rstudio sessions. So the congif setting was not taken. What could we do?

cirrus · ‎08-04-2023

Hi, We newly upgraded to CML (previously CDSW). The jupyterlab-runtime sessions go into a timeout according to the value set in the configuration - but the PBJ-runtime sessions not. Right now we assume that the PBJ sessions just don't go into some sort of idle state. Could that be? What could prevent it from going idle? How can we make that the user sessions are timeouted after the time set in the config? Thank you in advance.

cirrus · ‎06-23-2023

Thank you @haridjh ! It worked! I am even further confused because the underlying parquet file is already partitioned. But when inserting a repartition() the code works!

cirrus · ‎06-20-2023

Does anyone have an idea? Otherwise this is a block for working with larger datasets.

cirrus · ‎06-19-2023

Hi, I am hitting a kryoserializer buffer issue with the following simple line in PySpark (Spark version 2.4): df=ss.read.parquet(data_dir).limit(how_many).toPandas() Thus I am reading a partitioned parquet file in, limit it to 800k rows (still huge as it has 2500 columns) and try to convert toPandas. Kryoserializer.buffer.max is already at the maximum possible value: .config("spark.kryoserializer.buffer.max", "2047m") What other ways are there to make it run (except of reducing the amount of rows even further down)? While in stage1 he is making many steps (approx 157), in stage2 he has only one step - and thus tries to juggle with a very large object. The following in an excerpt of the error message: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 158, localhost, executor driver): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 78352248. To avoid this, increase spark.kryoserializer.buffer.max value. at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:331) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:461) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 78352248 at com.esotericsoftware.kryo.io.Output.require(Output.java:167) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:629) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:332) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:328) Edit1: After reducing howmany to 400k (from 800k as shown above) it also gives the exact same error (see below). This confuses me a bit: Why is the required amount the same? Buffer overflow. Available: 0, required: 78352248 Edit2: What also confuses me: There are three stages in spark for the above. The first one is clear: A very quick "parquet at NativeMethodAccessorImpl.java:0". In the third stage it crashes (and there is only one task there, so that it cannot be calculated in partitions or so). But my confusion is in the second stage: It results in "Input Size / Records" of 14.7 GB / 1922984 and in "Shuffle Write Size / Records" of 25.4 GB / 1922984. In other words it calculated here all of the rows of the file (approx 1.9 mio rows). The limit() function had no effect (?).

cirrus · ‎06-14-2023

Fun fact for those interested: In order to have 8 cores running you need in this example minimum 64m as xss options. If you chose 32m, then it will not give a stackoverflow error, but only 4 cores will be running 😲

cirrus · ‎06-13-2023

I tried to include a cache(), but it still takes that long: data = ss.range(rows).cache() I also reduced the following to 32 thinking that I overdid it now maybe with java stack size, but still the same effect .config("spark.driver.extraJavaOptions", "-Xss32m") \ .config("spark.executor.extraJavaOptions", "-Xss32m") \ @RangaReddy do you have an idea what I am doing wrong? Edit: I see in the logs the following - is cached actually working or does it show it only at the beginning? "Storage Level":{"Use Disk":false,"Use Memory":false,"Deserialized":false,"Replication":1},

cirrus · ‎06-13-2023

Hi, This is a reproducible, simple issue where the performance is surprisingly bad. It is a follow-up to the case under this link, where initially a stoackoverflow issue occurred. The script below ran for 26 hours in over 8 cores at full calculation as is seen in the hardware statistics. Of course the object is "quite large"- but in similar operations with such an object it does not take so long. The generated size was 42.5GB in the 8 parquet files on HDFS. Here is the code: from pyspark.sql import SparkSession from pyspark.sql.functions import rand ss = SparkSession.builder.appName("test_replication") \ .config("spark.kryoserializer.buffer.max.mb", "2047") \ .config('spark.sql.execution.arrow.pyspark.enabled', "true") \ .config("spark.driver.maxResultSize", "16G") \ .config("spark.driver.memory", "4G") \ .config("spark.executor.memory", "16G") \ .config("spark.dynamicAllocation.maxExecutors","8") \ .config("spark.executor.instances", "2") \ .config("spark.executor.cores", "4") \ .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.driver.extraJavaOptions", "-Xss1024m") \ .config("spark.executor.extraJavaOptions", "-Xss1024m") \ .config("spark.yarn.tags","dev") \ .getOrCreate() rows=2350000 cols=2500 hdfs_dir="/destination/on/hdfs" data = ss.range(rows) for i in range(cols): data=data.withColumn(f'col{i}', rand() * 2 -1) data.write.format("parquet").mode("overwrite").save(f"{hdfs_dir}/test.parquet") Am I doing something wrong? Edit: I see in the log of the applicationHistory the following element that surprises me - is this normal? "Storage Level":{"Use Disk":false,"Use Memory":false,"Deserialized":false,"Replication":1}, Edit2: Is this due to no cache() or persist() in place?

Online	Offline
Last Visited	‎10-07-2024 09:30 AM

Member Since	‎06-02-2023 07:10 AM
Last Visited	‎10-07-2024 09:30 AM
Posts	12

Cloudera Community

PBJ Workbench: All infos/warnings appear in red

Re: Timeout: PBJ session not going idle

Re: Timeout: PBJ session not going idle

Timeout: PBJ session not going idle

Re: KryoSerializer and toPandas():What if kryoseri...

Re: KryoSerializer and toPandas():What if kryoseri...

KryoSerializer and toPandas():What if kryoserializ...

Re: OOM issues when writing into parquet

Re: Performance issue with simple reproducible cas...

Performance issue with simple reproducible case