Created 06-01-2020 06:08 PM
Hi
I'm using pyspark and currently I am encountering an issue that I had not seen before when trying to write a dataframe to HDFS as a table. This is an example of the code I'm running:
df.write.mode('overwrite').format('parquet').saveAsTable('{}.{}'.format(target_schema, table_name))
And the stacktrace of the error I'm getting is the following:
Py4JJavaError: An error occurred while calling o1042.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:502)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Aborting TaskSet 3.0 because task 3 (partition 3)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 3.0 in stage 3.0 (TID 5, <host_name>.com, executor 3): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1363)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Decompression error: Unknown frame descriptor
at com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:147)
at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:107)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
... 3 more
The dataframes are actually oracle database tables that I read using JDBC. This has worked fine for a lot of tables, small and big. I'm not using any customSchema to change the source datatypes (which are all VARCHAR2s).
The code used to run without issues in a small cluster made of vms running in a single server with somewhat enough resources, this problem started appearing when I migrated it to a cloud based cluster, where I have dedicated resources for each node (still vms though). Both of these "environments" use CDH 6.2 with the same packaging, though the cluster based one has Kerberos enabled.
Any suggestions/guidance will be greatly appreciated
Created 06-01-2020 09:06 PM
Hie @mig_aguir ,
Thank you for posting your query
Could you please try running the job with --conf spark.unsafe.sorter.spill.read.ahead.enabled=false
Created 06-02-2020 02:03 PM
This indeed made it work. Thanks!
Created 06-01-2020 09:06 PM
Hie @mig_aguir ,
Thank you for posting your query
Could you please try running the job with --conf spark.unsafe.sorter.spill.read.ahead.enabled=false
Created 06-02-2020 02:03 PM
This indeed made it work. Thanks!
Created 03-01-2022 02:03 AM
hey, thank you for the solution, where I should add this code? '--conf spark.unsafe.sorter.spill.read.ahead.enabled=false'
Thank you!
Created 06-14-2022 12:00 PM
Hey, have you find out where to add this code? '--conf spark.unsafe.sorter.spill.read.ahead.enabled=false'
Thanks!!