Created 03-16-2017 06:56 PM
When running this small piece of Scala code I get a "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://xxx.eu-west-1.compute.internal:8020/user/cloudbreak/data/testdfsio-write".
Below the piece of code where the `saveAsTextFile` is executed. The directory does not exist before running this script. Why is this FileAlreadyExistsException being raised?
// Create a Range and parallelize it, on nFiles partitions
// The idea is to have a small RDD partitioned on a given number of workers
// then each worker will generate data to write
val a = sc.parallelize(1 until config.nFiles + 1, config.nFiles)
val b = a.map(i => {
// generate an array of Byte (8 bit), with dimension fSize
// fill it up with "0" chars, and make it a string for it to be saved as text
// TODO: this approach can still cause memory problems in the executor if the array is too big.
val x = Array.ofDim[Byte](fSizeBV.value).map(x => "0").mkString("")
x
})
// Force computation on the RDD
sc.runJob(b, (iter: Iterator[_]) => {})
// Write output file
val (junk, timeW) = profile {
b.saveAsTextFile(config.file)
}
Created 03-16-2017 08:13 PM
I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue.
Can you paste the exception stack (and possibly options) which causes this to surface ?
Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached.
Regards,
Mridul
Created 03-16-2017 08:13 PM
I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue.
Can you paste the exception stack (and possibly options) which causes this to surface ?
Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached.
Regards,
Mridul
Created 03-16-2017 08:18 PM
The exception seems to happen when nFiles is larger, like 1000, not when it's 10.
spark-submit --master yarn-cluster --class com.cisco.dfsio.test.Runner hdfs:///user/$USER/mantl-apps/benchmarking-apps/spark-test-dfsio-with-dependencies.jar --file data/testdfsio-write --nFiles 1000 --fSize 200000 -m write --log data/testdfsio-write/testHdfsIO-WRITE.log
btw: not my code.
Created 03-17-2017 09:02 PM
solved by not having to many partitions for parallelize