Member since
06-02-2023
12
Posts
0
Kudos Received
0
Solutions
08-10-2023
03:28 AM
Hi, In the CML, the sessions in the PBJ Workbench show all infos and warnings in red by default (with a little symbol). Is there a way to set the color to black? Kind regards!
... View more
Labels:
- Labels:
-
Cloudera Machine Learning (CML)
08-04-2023
04:21 AM
In other words: Why does this option not work: Web session timeouts (cloudera.com)
... View more
08-04-2023
02:24 AM
Correction: it is not just PBJ. Also for rstudio sessions. So the congif setting was not taken. What could we do?
... View more
08-04-2023
12:35 AM
Hi, We newly upgraded to CML (previously CDSW). The jupyterlab-runtime sessions go into a timeout according to the value set in the configuration - but the PBJ-runtime sessions not. Right now we assume that the PBJ sessions just don't go into some sort of idle state. Could that be? What could prevent it from going idle? How can we make that the user sessions are timeouted after the time set in the config? Thank you in advance.
... View more
Labels:
- Labels:
-
Cloudera Machine Learning (CML)
06-23-2023
01:23 AM
Thank you @haridjh ! It worked! I am even further confused because the underlying parquet file is already partitioned. But when inserting a repartition() the code works!
... View more
06-20-2023
10:48 PM
Does anyone have an idea? Otherwise this is a block for working with larger datasets.
... View more
06-19-2023
01:50 AM
Hi, I am hitting a kryoserializer buffer issue with the following simple line in PySpark (Spark version 2.4): df=ss.read.parquet(data_dir).limit(how_many).toPandas() Thus I am reading a partitioned parquet file in, limit it to 800k rows (still huge as it has 2500 columns) and try to convert toPandas. Kryoserializer.buffer.max is already at the maximum possible value: .config("spark.kryoserializer.buffer.max", "2047m") What other ways are there to make it run (except of reducing the amount of rows even further down)? While in stage1 he is making many steps (approx 157), in stage2 he has only one step - and thus tries to juggle with a very large object. The following in an excerpt of the error message: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 158, localhost, executor driver): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 78352248. To avoid this, increase spark.kryoserializer.buffer.max value.
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:331)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:461)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 78352248
at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:629)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:332)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:328) Edit1: After reducing howmany to 400k (from 800k as shown above) it also gives the exact same error (see below). This confuses me a bit: Why is the required amount the same? Buffer overflow. Available: 0, required: 78352248 Edit2: What also confuses me: There are three stages in spark for the above. The first one is clear: A very quick "parquet at NativeMethodAccessorImpl.java:0". In the third stage it crashes (and there is only one task there, so that it cannot be calculated in partitions or so). But my confusion is in the second stage: It results in "Input Size / Records" of 14.7 GB / 1922984 and in "Shuffle Write Size / Records" of 25.4 GB / 1922984. In other words it calculated here all of the rows of the file (approx 1.9 mio rows). The limit() function had no effect (?).
... View more
Labels:
- Labels:
-
Apache Spark
06-14-2023
01:06 AM
Fun fact for those interested: In order to have 8 cores running you need in this example minimum 64m as xss options. If you chose 32m, then it will not give a stackoverflow error, but only 4 cores will be running 😲
... View more
06-13-2023
10:18 PM
I tried to include a cache(), but it still takes that long: data = ss.range(rows).cache() I also reduced the following to 32 thinking that I overdid it now maybe with java stack size, but still the same effect .config("spark.driver.extraJavaOptions", "-Xss32m") \ .config("spark.executor.extraJavaOptions", "-Xss32m") \ @RangaReddy do you have an idea what I am doing wrong? Edit: I see in the logs the following - is cached actually working or does it show it only at the beginning? "Storage Level":{"Use Disk":false,"Use Memory":false,"Deserialized":false,"Replication":1},
... View more
06-13-2023
01:19 AM
Hi, This is a reproducible, simple issue where the performance is surprisingly bad. It is a follow-up to the case under this link, where initially a stoackoverflow issue occurred. The script below ran for 26 hours in over 8 cores at full calculation as is seen in the hardware statistics. Of course the object is "quite large"- but in similar operations with such an object it does not take so long. The generated size was 42.5GB in the 8 parquet files on HDFS. Here is the code: from pyspark.sql import SparkSession from pyspark.sql.functions import rand ss = SparkSession.builder.appName("test_replication") \ .config("spark.kryoserializer.buffer.max.mb", "2047") \ .config('spark.sql.execution.arrow.pyspark.enabled', "true") \ .config("spark.driver.maxResultSize", "16G") \ .config("spark.driver.memory", "4G") \ .config("spark.executor.memory", "16G") \ .config("spark.dynamicAllocation.maxExecutors","8") \ .config("spark.executor.instances", "2") \ .config("spark.executor.cores", "4") \ .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.driver.extraJavaOptions", "-Xss1024m") \ .config("spark.executor.extraJavaOptions", "-Xss1024m") \ .config("spark.yarn.tags","dev") \ .getOrCreate() rows=2350000 cols=2500 hdfs_dir="/destination/on/hdfs" data = ss.range(rows) for i in range(cols): data=data.withColumn(f'col{i}', rand() * 2 -1) data.write.format("parquet").mode("overwrite").save(f"{hdfs_dir}/test.parquet") Am I doing something wrong? Edit: I see in the log of the applicationHistory the following element that surprises me - is this normal? "Storage Level":{"Use Disk":false,"Use Memory":false,"Deserialized":false,"Replication":1}, Edit2: Is this due to no cache() or persist() in place?
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache YARN