Support Questions

JSenzier · ‎06-05-2018

Hi,

I have an issue with Spark, the job failed with this error message :

scala> someDF.write.mode(SaveMode.Append).parquet("file:///data/bbox/tmp")
[Stage 0:>                                                          (0 + 2) / 2]18/06/05 12:37:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, dec-bb-dl03.bbox-dec.lab.oxv.fr, executor 1): java.io.IOException: Mkdirs failed to create file:/data/bbox/tmp/_temporary/0/_temporary/attempt_201806051237_0000_m_000000_0 (exists=false, cwd=file:/yarn/nm/usercache/hdfs/appcache/application_1527756804026_0065/container_e33_1527756804026_0065_01_000002)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:926)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:907)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
        at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:225)
        at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
        at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:286)
        at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
        at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:255)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

18/06/05 12:37:39 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2, dec-bb-dl03.bbox-dec.lab.oxv.fr, executor 1): java.io.IOException: Mkdirs failed to create file:/data/bbox/tmp/_temporary/0/_temporary/attempt_201806051237_0000_m_000000_1 (exists=false, cwd=file:/yarn/nm/usercache/hdfs/appcache/application_1527756804026_0065/container_e33_1527756804026_0065_01_000002)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:926)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:907)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
        at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:225)
        at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311)
        at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:286)
        at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
        at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:255)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

We use CDH 5.14 with the Spark included into the CDH (1.6.0), we think about an version incompatibility issue.

First I tried to change directory rights (777 or give write right to hadoop group), but it didn't work.

Any idea ?

Julien.

AutoIN · ‎06-07-2018

Hi @JSenzier

Right, this won't work in client mode. It's not about the compatibility of Spark1.6 with CDH version, but the way deploy mode 'client' works. spark-shell on Cloudera installs runs in yarn-client mode by default. Given the use of file:/// (which is generally used for local disks) we recommend running the app in local mode for such local testing or you can turn your script (using maven or sbt) into a jar file and execute this using spark-submit in cluster mode.

$ spark-shell --master local[*]

View solution in original post

AutoIN · ‎06-07-2018

Hi @JSenzier

Right, this won't work in client mode. It's not about the compatibility of Spark1.6 with CDH version, but the way deploy mode 'client' works. spark-shell on Cloudera installs runs in yarn-client mode by default. Given the use of file:/// (which is generally used for local disks) we recommend running the app in local mode for such local testing or you can turn your script (using maven or sbt) into a jar file and execute this using spark-submit in cluster mode.

$ spark-shell --master local[*]

JSenzier · ‎06-07-2018

Hi,

Thank you for your help, it's working, it's not very easy to understand when we encountered this issue (i didn't understand why yarn tried to create files into _temporary directory first), but with this explanation we can now understand this behaviour, so thank you 😉

ArchenROOT · ‎05-31-2019

My simple ETL code:

def xmlConvert(spark):
    etl_time = time.time()
    df = spark.read.format('com.databricks.spark.xml').options(rowTag='HistoricalTextData').load(
        'file:///home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/data_train')
    df = df.withColumn("TimeStamp", df["TimeStamp"].cast("timestamp")).groupBy("TimeStamp").pivot("TagName").sum(
        "TagValue").na.fill(0)
    df.repartition(1).write.csv(
        path="file:///proj/prometheus-core/demo/demo-1-iot-predictive-maintainance/dataset/",
        mode="overwrite",
        header=True,
        sep=",")
    print("Time taken to do xml transformation: --- %s seconds ---" % (time.time() - etl_time))


if __name__ == '__main__':
    spark = SparkSession \
        .builder \
        .appName('XML ETL') \
        .master("local[*]") \
        .config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \
        .getOrCreate()

    print('Session created')

    try:
        xmlConvert(spark)

    finally:
        spark.stop()

Still throwing the issue reported.

ArchenROOT · ‎05-31-2019

And I found a solution by pointint job.local.dir to directory with the code:

spark = SparkSession \
    .builder \
    .appName('XML ETL') \
    .master("local[*]") \
    .config('job.local.dir', 'file:/home/zangetsu/proj/prometheus-core/demo/demo-1-iot-predictive-maintainance') \
    .config('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.5.0') \
    .getOrCreate()

Now all works

Cloudera Community

Support Questions

Spark - Cannot mkdir file