Reply
New Contributor
Posts: 4
Registered: ‎09-11-2014

Spark on Yarn - Avro schema fails in clustered mode on YARN

I am getting a weird Avro schema error when running as clustered.  I don’t think the error is related to the avro schema because it works perfectly fine when running as “local”

(I suspect what is happening is that when distributing there is some kind of type erasure ( Scala infers types to java primitive types)

 

Ex:

 

record.put("SUPPLY_INSTANCE_KEY",10L)  ( hard coded) works   //('record' is GenrericRecord)

 

However:

this:

  val sik=10L

  record.put("SUPPLY_INSTANCE_KEY",sik)     

 

breaks with:

 

Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 7.0:0 failed 4 times (most recent failure: Exception failure: org.apache.avro.UnresolvedUnionException: Not in union ["null","string"]: 10)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)

        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

        at scala.Option.foreach(Option.scala:236)

        at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)

        at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

        at akka.actor.ActorCell.invoke(ActorCell.scala:456)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)

        at akka.dispatch.Mailbox.run(Mailbox.scala:219)

        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

The schema is :

{ "name": "SUPPLY_INSTANCE_KEY", "type":  ["null","long"]},  ..

 

This only happens when running as “yarn-client” ! -Any idea why?