About mig_aguir

mig_aguir · ‎01-09-2021

Cool, thanks for the help! Just another question, if I just were to remove the directory I don't want to use for storing staging data and then restarted Yarn service, would there be any problem? The "yarn.nodemanager.local-dirs" property already includes three other directories for storing so it shouldn't be a problem, right?

mig_aguir · ‎01-08-2021

We made a mistake while configuring the storage of our worker nodes and we ended up adding a data mount point of a really small disk to the yarn.nodemanager.local-dirs property. This causes some of our very data-intensive spark jobs to fail as this disk runs out of space pretty quickly as these jobs generate a lot of staging data. Right now the value of this property is the following: /u01/hadoop/yarn/nm,/u02/hadoop/yarn/nm,/u03/hadoop/yarn/nm,/u04/hadoop/yarn/nm We just want to remove /u04/hadoop/yarn/nm from the property, all the other directories are mounted on disks with enough capacity. So, is it safe to just remove this directory from the yarn.nodemanager.local-dirs property? Or do we have to back up the data in /u04/hadoop/yarn/nm and create another data mount point using a disk with enough capacity and put that backup there?

mig_aguir · ‎06-02-2020

This indeed made it work. Thanks!

mig_aguir · ‎06-01-2020

Hi I'm using pyspark and currently I am encountering an issue that I had not seen before when trying to write a dataframe to HDFS as a table. This is an example of the code I'm running: df.write.mode('overwrite').format('parquet').saveAsTable('{}.{}'.format(target_schema, table_name)) And the stacktrace of the error I'm getting is the following: Py4JJavaError: An error occurred while calling o1042.saveAsTable. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:502) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Aborting TaskSet 3.0 because task 3 (partition 3) cannot run anywhere due to node and executor blacklist. Most recent failure: Lost task 3.0 in stage 3.0 (TID 5, <host_name>.com, executor 3): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1363) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Decompression error: Unknown frame descriptor at com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:147) at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:107) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168) ... 3 more The dataframes are actually oracle database tables that I read using JDBC. This has worked fine for a lot of tables, small and big. I'm not using any customSchema to change the source datatypes (which are all VARCHAR2s). The code used to run without issues in a small cluster made of vms running in a single server with somewhat enough resources, this problem started appearing when I migrated it to a cloud based cluster, where I have dedicated resources for each node (still vms though). Both of these "environments" use CDH 6.2 with the same packaging, though the cluster based one has Kerberos enabled. Any suggestions/guidance will be greatly appreciated

Online	Offline
Last Visited	‎01-19-2021 03:42 PM

Member Since	‎01-17-2020 08:35 AM
Last Visited	‎01-19-2021 03:42 PM
Posts	6

Cloudera Community

Re: CDH 6.2 Spark 2.4.0 saveAsTable error: java.i...

Re: Remove directory from yarn.nodemanager.local-d...

Remove directory from yarn.nodemanager.local-dirs ...

Re: CDH 6.2 Spark 2.4.0 saveAsTable error: java.i...

CDH 6.2 Spark 2.4.0 saveAsTable error: java.io.IO...