Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Access multiple hadoop namenodes from Spark on YARN ?

Highlighted

Access multiple hadoop namenodes from Spark on YARN ?

New Contributor

Hi everyone,

I am working with a system that Spark jobs run on YARN with one default Hadoop namenode. Recently, I have added another Hadoop namenodes to my system. I now want the Spark jobs read the input data from the default namenode and write the output to the second one. How can I config or specific the path for the Spark jobs? I tried to put the hdfs path to the code, for example:

val spark = SparkSession.builder().appName("Example").getOrCreate();
val input = spark.read.parquet("hdfs://defaultnamenode:9000/sample.parquet")
input.write.parquet("hdfs://secondnamenode:9000/sample")

But it threw the exception:

17/12/18 10:49:42 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: path hdfs://secondnamenode:9000/sample already exists.;
org.apache.spark.sql.AnalysisException: path hdfs://secondnamenode:9000/sample already exists.;
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:106)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
        at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
        at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
        at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:509)

And there is no result at all :(.

2 REPLIES 2
Highlighted

Re: Access multiple hadoop namenodes from Spark on YARN ?

Super Collaborator

The exception says hdfs://secondnamenode:9000/sample already exists.

List the content of the other cluster, and delete the directory if that folder is indeed there.

Otherwise, add the overwrite option to your Spark code.

Highlighted

Re: Access multiple hadoop namenodes from Spark on YARN ?

New Contributor

@Jordan Moore: Well, the thing is that folder didn't exist at first. When the Spark job was running, it created that empty folder and threw the Exception itself :-/

Don't have an account?
Coming from Hortonworks? Activate your account here