Reply
Highlighted
New Contributor
Posts: 3
Registered: ‎05-06-2014

spark-shell hangs reading avro data

Hello,

 

We are trying to do some interactive analysis in the spark-shell with some compressed avro data.  Whenever we try to load the data the shell seems to hang after processing about 25% of the partitions.  Here's an example:

 

 

val conf = sc.hadoopConfiguration

val inputSchema = new Schema.Parser().parse(FileSystem.get(conf).open("schema.avsc").getWrappedStream())

conf.set("avro.schema.input.key", inputSchema.toString())

 

val inputRecords = sc.newAPIHadoopFile("schema.avsc",

  classOf[AvroKeyInputFormat[GenericRecord]],

  classOf[AvroKey[GenericRecord]],

  classOf[NullWritable], conf)

 

println(s"Total Input Records: ${inputRecords.count}")

 

 

We submit this with:

 

spark-submit \

  --master yarn-client \

  --executor-memory 7G \

  --num-executors 100 \

  --executor-cores 5 \

  --driver-memory 7G \

  --files log4j.properties

 

The job seems to start just fine, and about 25% of the tasks seem to complete, but then nothing further happens.  I can't find anything in the driver or executor logs to indicate what the problem is.  Seems to be waiting on something, I just don't know what.

 

If I package that same scala code up inside an object:

 

object AvroCount {

  def main(args: Array[String]) {

    // same code as above

  }

}

 

create a jar and submit with spark-submit, everything seems to work fine.  Also note that when I'm working with plain text files in the spark-shell, I don't experience this problem.

 

Anyone have any suggestions about what the problem might be?

 

Thanks.