Support Questions

Find answers, ask questions, and share your expertise

FileAlreadyExistsException when calling saveAsTextFile

avatar

When running this small piece of Scala code I get a "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://xxx.eu-west-1.compute.internal:8020/user/cloudbreak/data/testdfsio-write".

Below the piece of code where the `saveAsTextFile` is executed. The directory does not exist before running this script. Why is this FileAlreadyExistsException being raised?

            // Create a Range and parallelize it, on nFiles partitions
            // The idea is to have a small RDD partitioned on a given number of workers
            // then each worker will generate data to write
            val a = sc.parallelize(1 until config.nFiles + 1, config.nFiles)


            val b = a.map(i => {
              // generate an array of Byte (8 bit), with dimension fSize
              // fill it up with "0" chars, and make it a string for it to be saved as text
              // TODO: this approach can still cause memory problems in the executor if the array is too big.
              val x = Array.ofDim[Byte](fSizeBV.value).map(x => "0").mkString("")
              x
            })


            // Force computation on the RDD
            sc.runJob(b, (iter: Iterator[_]) => {})


            // Write output file
            val (junk, timeW) = profile {
              b.saveAsTextFile(config.file)
            }
1 ACCEPTED SOLUTION

avatar

I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue.

Can you paste the exception stack (and possibly options) which causes this to surface ?

Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached.

Regards,

Mridul

View solution in original post

3 REPLIES 3

avatar

I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue.

Can you paste the exception stack (and possibly options) which causes this to surface ?

Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached.

Regards,

Mridul

avatar

The exception seems to happen when nFiles is larger, like 1000, not when it's 10.

spark-submit --master yarn-cluster --class com.cisco.dfsio.test.Runner hdfs:///user/$USER/mantl-apps/benchmarking-apps/spark-test-dfsio-with-dependencies.jar --file data/testdfsio-write --nFiles 1000 --fSize 200000 -m write --log data/testdfsio-write/testHdfsIO-WRITE.log

btw: not my code.

avatar

solved by not having to many partitions for parallelize