New Contributor
Posts: 3
Registered: ‎05-06-2014

spark-shell hangs reading avro data



We are trying to do some interactive analysis in the spark-shell with some compressed avro data.  Whenever we try to load the data the shell seems to hang after processing about 25% of the partitions.  Here's an example:



val conf = sc.hadoopConfiguration

val inputSchema = new Schema.Parser().parse(FileSystem.get(conf).open("schema.avsc").getWrappedStream())

conf.set("avro.schema.input.key", inputSchema.toString())


val inputRecords = sc.newAPIHadoopFile("schema.avsc",



  classOf[NullWritable], conf)


println(s"Total Input Records: ${inputRecords.count}")



We submit this with:


spark-submit \

  --master yarn-client \

  --executor-memory 7G \

  --num-executors 100 \

  --executor-cores 5 \

  --driver-memory 7G \



The job seems to start just fine, and about 25% of the tasks seem to complete, but then nothing further happens.  I can't find anything in the driver or executor logs to indicate what the problem is.  Seems to be waiting on something, I just don't know what.


If I package that same scala code up inside an object:


object AvroCount {

  def main(args: Array[String]) {

    // same code as above




create a jar and submit with spark-submit, everything seems to work fine.  Also note that when I'm working with plain text files in the spark-shell, I don't experience this problem.


Anyone have any suggestions about what the problem might be?