Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

FileAlreadyExistsException when calling saveAsTextFile

avatar

When running this small piece of Scala code I get a "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://xxx.eu-west-1.compute.internal:8020/user/cloudbreak/data/testdfsio-write".

Below the piece of code where the `saveAsTextFile` is executed. The directory does not exist before running this script. Why is this FileAlreadyExistsException being raised?

            // Create a Range and parallelize it, on nFiles partitions
            // The idea is to have a small RDD partitioned on a given number of workers
            // then each worker will generate data to write
            val a = sc.parallelize(1 until config.nFiles + 1, config.nFiles)


            val b = a.map(i => {
              // generate an array of Byte (8 bit), with dimension fSize
              // fill it up with "0" chars, and make it a string for it to be saved as text
              // TODO: this approach can still cause memory problems in the executor if the array is too big.
              val x = Array.ofDim[Byte](fSizeBV.value).map(x => "0").mkString("")
              x
            })


            // Force computation on the RDD
            sc.runJob(b, (iter: Iterator[_]) => {})


            // Write output file
            val (junk, timeW) = profile {
              b.saveAsTextFile(config.file)
            }
1 ACCEPTED SOLUTION

avatar

I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue.

Can you paste the exception stack (and possibly options) which causes this to surface ?

Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached.

Regards,

Mridul

View solution in original post

3 REPLIES 3

avatar

I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue.

Can you paste the exception stack (and possibly options) which causes this to surface ?

Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached.

Regards,

Mridul

avatar

The exception seems to happen when nFiles is larger, like 1000, not when it's 10.

spark-submit --master yarn-cluster --class com.cisco.dfsio.test.Runner hdfs:///user/$USER/mantl-apps/benchmarking-apps/spark-test-dfsio-with-dependencies.jar --file data/testdfsio-write --nFiles 1000 --fSize 200000 -m write --log data/testdfsio-write/testHdfsIO-WRITE.log

btw: not my code.

avatar

solved by not having to many partitions for parallelize