Created 03-16-2017 06:56 PM
When running this small piece of Scala code I get a "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://xxx.eu-west-1.compute.internal:8020/user/cloudbreak/data/testdfsio-write".
Below the piece of code where the `saveAsTextFile` is executed. The directory does not exist before running this script. Why is this FileAlreadyExistsException being raised?
// Create a Range and parallelize it, on nFiles partitions // The idea is to have a small RDD partitioned on a given number of workers // then each worker will generate data to write val a = sc.parallelize(1 until config.nFiles + 1, config.nFiles) val b = a.map(i => { // generate an array of Byte (8 bit), with dimension fSize // fill it up with "0" chars, and make it a string for it to be saved as text // TODO: this approach can still cause memory problems in the executor if the array is too big. val x = Array.ofDim[Byte](fSizeBV.value).map(x => "0").mkString("") x }) // Force computation on the RDD sc.runJob(b, (iter: Iterator[_]) => {}) // Write output file val (junk, timeW) = profile { b.saveAsTextFile(config.file) }
Created 03-16-2017 08:13 PM
I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue.
Can you paste the exception stack (and possibly options) which causes this to surface ?
Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached.
Regards,
Mridul
Created 03-16-2017 08:13 PM
I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue.
Can you paste the exception stack (and possibly options) which causes this to surface ?
Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached.
Regards,
Mridul
Created 03-16-2017 08:18 PM
The exception seems to happen when nFiles is larger, like 1000, not when it's 10.
spark-submit --master yarn-cluster --class com.cisco.dfsio.test.Runner hdfs:///user/$USER/mantl-apps/benchmarking-apps/spark-test-dfsio-with-dependencies.jar --file data/testdfsio-write --nFiles 1000 --fSize 200000 -m write --log data/testdfsio-write/testHdfsIO-WRITE.log
btw: not my code.
Created 03-17-2017 09:02 PM
solved by not having to many partitions for parallelize