Support Questions

wbekker · ‎03-16-2017

When running this small piece of Scala code I get a "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://xxx.eu-west-1.compute.internal:8020/user/cloudbreak/data/testdfsio-write".

Below the piece of code where the `saveAsTextFile` is executed. The directory does not exist before running this script. Why is this FileAlreadyExistsException being raised?

            // Create a Range and parallelize it, on nFiles partitions
            // The idea is to have a small RDD partitioned on a given number of workers
            // then each worker will generate data to write
            val a = sc.parallelize(1 until config.nFiles + 1, config.nFiles)


            val b = a.map(i => {
              // generate an array of Byte (8 bit), with dimension fSize
              // fill it up with "0" chars, and make it a string for it to be saved as text
              // TODO: this approach can still cause memory problems in the executor if the array is too big.
              val x = Array.ofDim[Byte](fSizeBV.value).map(x => "0").mkString("")
              x
            })


            // Force computation on the RDD
            sc.runJob(b, (iter: Iterator[_]) => {})


            // Write output file
            val (junk, timeW) = profile {
              b.saveAsTextFile(config.file)
            }

mmuralidharan · ‎03-16-2017

I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue.

Can you paste the exception stack (and possibly options) which causes this to surface ?

Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached.

Regards,

Mridul

View solution in original post

mmuralidharan · ‎03-16-2017

I could run 'Runner' without errors in local mode; so the code itself is probably is not an issue.

Can you paste the exception stack (and possibly options) which causes this to surface ?

Also, not sure why you are doing the runJob - it will essentially be a noop in this case since data is not cached.

Regards,

Mridul

wbekker · ‎03-16-2017

The exception seems to happen when nFiles is larger, like 1000, not when it's 10.

spark-submit --master yarn-cluster --class com.cisco.dfsio.test.Runner hdfs:///user/$USER/mantl-apps/benchmarking-apps/spark-test-dfsio-with-dependencies.jar --file data/testdfsio-write --nFiles 1000 --fSize 200000 -m write --log data/testdfsio-write/testHdfsIO-WRITE.log

btw: not my code.

wbekker · ‎03-17-2017

solved by not having to many partitions for parallelize

Cloudera Community

Support Questions

FileAlreadyExistsException when calling saveAsTextFile

Pyspark issue AttributeError: 'DataFrame' object h...

How to use saveAsTextFiles in spark streaming

NiFi Execute Stream Command Curl Call

How to call a CML Deployed Model From Apache NiFi ...

value saveAsTextFile is not a member of Array[Arra...

CDH 5.5 Spark 1.5 has sporadic failures including ...

Rest call to ranger on wire encrypted cluster

Hive Error while calling watcher java.util.concur...

FileAlreadyExistsException when inserting in to Hi...

How to call CDSW's API to get all sessions/jobs li...