Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

CDH 6.2 Spark 2.4.0 saveAsTable error: java.io.IOException: Decompression error: Unknown frame descriptor

Explorer

Hi

 

I'm using pyspark and currently I am encountering an issue that I had not seen before when trying to write a dataframe to HDFS as a table. This is an example of the code I'm running:

 

 

 

df.write.mode('overwrite').format('parquet').saveAsTable('{}.{}'.format(target_schema, table_name))

 

 

 

And the stacktrace of the error I'm getting is the following:

 

 

 

Py4JJavaError: An error occurred while calling o1042.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
	at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:502)
	at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
	at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
	at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
	at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
	at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Aborting TaskSet 3.0 because task 3 (partition 3)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 3.0 in stage 3.0 (TID 5, <host_name>.com, executor 3): org.apache.spark.SparkException: Task failed while writing rows.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1363)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Decompression error: Unknown frame descriptor
	at com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:147)
	at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:107)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
	... 3 more

 

 

 

 

 The dataframes are actually oracle database tables that I read using JDBC. This has worked fine for a lot of tables, small and big. I'm not using any customSchema to change the source datatypes (which are all VARCHAR2s).

 

The code used to run without issues in a small cluster made of vms running in a single server with somewhat enough resources, this problem started appearing when I migrated it to a cloud based cluster, where I have dedicated resources for each node (still vms though). Both of these "environments" use CDH 6.2 with the same packaging, though the cluster based one has Kerberos enabled.

 

Any suggestions/guidance will be greatly appreciated

2 ACCEPTED SOLUTIONS

Expert Contributor

Hie @mig_aguir ,

 

Thank you for posting your query

Could you please try running the job with --conf spark.unsafe.sorter.spill.read.ahead.enabled=false

Thanks,
Satz

View solution in original post

Explorer

This indeed made it work. Thanks!

View solution in original post

4 REPLIES 4

Expert Contributor

Hie @mig_aguir ,

 

Thank you for posting your query

Could you please try running the job with --conf spark.unsafe.sorter.spill.read.ahead.enabled=false

Thanks,
Satz

Explorer

This indeed made it work. Thanks!

New Contributor

hey, thank you for the solution, where I should add this code? '--conf spark.unsafe.sorter.spill.read.ahead.enabled=false'

 

Thank you!

 

New Contributor

Hey, have you find out where to add this code? '--conf spark.unsafe.sorter.spill.read.ahead.enabled=false'
Thanks!!

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.