Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark - Cannot mkdir file

Solved Go to solution

Spark - Cannot mkdir file

Explorer

Hi,

 

I have an issue with Spark, the job failed with this error message :

 

scala> someDF.write.mode(SaveMode.Append).parquet("file:///data/bbox/tmp")
[Stage 0:>                                                          (0 + 2) / 2]18/06/05 12:37:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, dec-bb-dl03.bbox-dec.lab.oxv.fr, executor 1): java.io.IOException: Mkdirs failed to create file:/data/bbox/tmp/_temporary/0/_temporary/attempt_201806051237_0000_m_000000_0 (exists=false, cwd=file:/yarn/nm/usercache/hdfs/appcache/application_1527756804026_0065/container_e33_1527756804026_0065_01_000002)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:926)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:907)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
        at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:225)
        at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
        at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:286)
        at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
        at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:255)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

18/06/05 12:37:39 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2, dec-bb-dl03.bbox-dec.lab.oxv.fr, executor 1): java.io.IOException: Mkdirs failed to create file:/data/bbox/tmp/_temporary/0/_temporary/attempt_201806051237_0000_m_000000_1 (exists=false, cwd=file:/yarn/nm/usercache/hdfs/appcache/application_1527756804026_0065/container_e33_1527756804026_0065_01_000002)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:926)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:907)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
        at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:225)
        at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
        at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:286)
        at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
        at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:255)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

 

 

We use CDH 5.14 with the Spark included into the CDH (1.6.0), we think about an version incompatibility issue.

 

First I tried to change directory rights (777 or give write right to hadoop group), but it didn't work.

 

Any idea ?

 

Julien.

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Spark - Cannot mkdir file

Expert Contributor

Hi @JSenzier 

 

Right, this won't work in client mode. It's not about the compatibility of Spark1.6 with CDH version, but the way deploy mode 'client' works. spark-shell on Cloudera installs runs in yarn-client mode by default. Given the use of file:/// (which is generally used for local disks) we recommend running the app in local mode for such local testing or you can turn your script (using maven or sbt) into a jar file and execute this using spark-submit in cluster mode. 

 

$ spark-shell --master local[*]

4 REPLIES 4

Re: Spark - Cannot mkdir file

Expert Contributor

Hi @JSenzier 

 

Right, this won't work in client mode. It's not about the compatibility of Spark1.6 with CDH version, but the way deploy mode 'client' works. spark-shell on Cloudera installs runs in yarn-client mode by default. Given the use of file:/// (which is generally used for local disks) we recommend running the app in local mode for such local testing or you can turn your script (using maven or sbt) into a jar file and execute this using spark-submit in cluster mode. 

 

$ spark-shell --master local[*]

Re: Spark - Cannot mkdir file

Explorer

Hi,

 

Thank you for your help, it's working, it's not very easy to understand when we encountered this issue (i didn't understand why yarn tried to create files into _temporary directory first), but with this explanation we can now understand this behaviour, so thank you ;)

Re: Spark - Cannot mkdir file

Explorer

My simple ETL code:

def xmlConvert(spark):
etl_time = time.time()
df = spark.read.format('com.databricks.spark.xml').options(rowTag='HistoricalTextData').load(
'file:///home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/data_train')
df = df.withColumn("TimeStamp", df["TimeStamp"].cast("timestamp")).groupBy("TimeStamp").pivot("TagName").sum(
"TagValue").na.fill(0)
df.repartition(1).write.csv(
path="file:///proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/",
mode="overwrite",
header=True,
sep=",")
print("Time taken to do xml transformation: --- %s seconds ---" % (time.time() - etl_time))


if __name__ == '__main__':
spark = SparkSession \
.builder \
.appName('XML ETL') \
.master("local[*]") \
.config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \
.getOrCreate()

print('Session created')

try:
xmlConvert(spark)

finally:
spark.stop()

Still throwing the issue reported.

Highlighted

Re: Spark - Cannot mkdir file

Explorer

And I found a solution by pointint job.local.dir to directory with the code:

spark = SparkSession \
.builder \
.appName('XML ETL') \
.master("local[*]") \
.config('job.local.dir', 'file:/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance') \
.config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \
.getOrCreate()

Now all works 

Don't have an account?
Coming from Hortonworks? Activate your account here