Posts: 58
Registered: ‎10-19-2014

How to process a large data set with Spark



Here's a our scenario:

  • Data stored in HDFS as Avro
  • Data is partitioned and there are approx. 120 partitions
  • Each partition has around 3,200 files in it
  • The file sizes vary, as small as 2 kB and up to 50 MB
  • In total there is roughly 3 TB of data
  • (we are well aware that such data layout is not ideal)


  • Run a query against this data to find a small set of records, maybe around 100 rows matching some criteria




import sys
from pyspark import SparkContext
from pyspark.sql import SQLContext

if __name__ == "__main__":

	sc = SparkContext()
	sqlContext = SQLContext( sc )

	df_input = "com.databricks.spark.avro" ).load( "hdfs://nameservice1/path/to/our/data" )
	df_filtered = df_input.where( "someattribute in ('filtervalue1', 'filtervalue2')" )

	cnt = df_filtered.count()
	print( "Record count: %i" % cnt )

Submit the code:


spark-submit --master yarn --num-executors 50 --executor-memory 2G --driver-memory 50G --driver-cores 10


  • This runs for around many hours without producing any meaningful output. Eventually it crashes either with GC error, disk out of space error, or we are forced to kill it.
  • We've played with different values for the --driver-memory setting, up to 200 GB. This resulted in the program running for over six hours at which point we killed it.
  • Corresponding query in Hive or Pig would take around 1.5 - 2 hours


  • Where are we going wrong? :)


Many thanks in advance,


New Contributor
Posts: 1
Registered: ‎05-01-2018

Re: How to process a large data set with Spark



I am facing a similar problem, how did you resolve it?