Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

CDH 6.2 Spark 2.4.0 saveAsTable error: java.io.IOException: Decompression error: Unknown frame descriptor

avatar
Explorer

Hi

 

I'm using pyspark and currently I am encountering an issue that I had not seen before when trying to write a dataframe to HDFS as a table. This is an example of the code I'm running:

 

 

 

df.write.mode('overwrite').format('parquet').saveAsTable('{}.{}'.format(target_schema, table_name))

 

 

 

And the stacktrace of the error I'm getting is the following:

 

 

 

Py4JJavaError: An error occurred while calling o1042.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
	at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:502)
	at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
	at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
	at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
	at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
	at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Aborting TaskSet 3.0 because task 3 (partition 3)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 3.0 in stage 3.0 (TID 5, <host_name>.com, executor 3): org.apache.spark.SparkException: Task failed while writing rows.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1363)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Decompression error: Unknown frame descriptor
	at com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:147)
	at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:107)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
	... 3 more

 

 

 

 

 The dataframes are actually oracle database tables that I read using JDBC. This has worked fine for a lot of tables, small and big. I'm not using any customSchema to change the source datatypes (which are all VARCHAR2s).

 

The code used to run without issues in a small cluster made of vms running in a single server with somewhat enough resources, this problem started appearing when I migrated it to a cloud based cluster, where I have dedicated resources for each node (still vms though). Both of these "environments" use CDH 6.2 with the same packaging, though the cluster based one has Kerberos enabled.

 

Any suggestions/guidance will be greatly appreciated

2 ACCEPTED SOLUTIONS

avatar
Expert Contributor

Hie @mig_aguir ,

 

Thank you for posting your query

Could you please try running the job with --conf spark.unsafe.sorter.spill.read.ahead.enabled=false

Thanks,
Satz

View solution in original post

avatar
Explorer

This indeed made it work. Thanks!

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

Hie @mig_aguir ,

 

Thank you for posting your query

Could you please try running the job with --conf spark.unsafe.sorter.spill.read.ahead.enabled=false

Thanks,
Satz

avatar
Explorer

This indeed made it work. Thanks!

avatar
New Contributor

hey, thank you for the solution, where I should add this code? '--conf spark.unsafe.sorter.spill.read.ahead.enabled=false'

 

Thank you!

 

avatar
New Contributor

Hey, have you find out where to add this code? '--conf spark.unsafe.sorter.spill.read.ahead.enabled=false'
Thanks!!