Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Processing 250GB of data from an S3 bucket

New Contributor


I'm trying to use my (quite huge cluster) as much as I can but it seems to me that no matter how much node I throw in the time to finish processing the data doesn't change much. It takes over 3 hours to complete which I'm sure can be reduced drasically.. 

Briefly my script does these spark steps:

 = spark_context.textFile(s3_bucket)

data_after_map = x: somestuff(x)))

flat_map = data_after_map.flatMap(lambda x: x)

result =flat_map.reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))

result = a: my_parsing(a))

result = sqlContext.createDataFrame(...)

result.write.format("com.databricks.spark.csv") \

            .options(header=print_header, charset='utf-8').save(s3_bucket)

The Setup on AWS is:

1 Master: 8 vCPU, 15 GiB memory, 80 SSD GB storage
20 nodes with each node having 
36 vCPU, 60 GiB memory, 32Gb of storage


  "maximizeResourceAllocation": "true"

  'yarn.scheduler.capacity.resource-calculator': 'org.apache.hadoop.yarn.util.resource.DominantResourceCalculator'

  "spark.dynamicAllocation.enabled": "true",

  "spark.shuffle.service.enabled": "true"


When I Check on Hadoop UI it says:

Memory total: 1.02TB
Memory used: 972.25GB

vCore total:  720
vCore used: 685

on the spark UI it says:

Is there something from the above details that could lead to a counter productive settings ?

Thank you!


Master Collaborator

You're almost certainly bottlenecked on reading for S3, or listing the S3 directories, or both. You'd have to examine the stats in the Spark UI to really know. Given your simple job, it's almost certainly nothing to do with memory or CPU, or the shuffle. Try parallelizing more with more, smaller input files? splitting across more workers to make more bandwidth available?


I'm also not clear if you're using a Cloudera cluster here.

New Contributor

Yes I thought about S3 being the issue as well but didn't know how to verify this.

Could you point me to where I could check this on the Spark-UI ? Or at least what kind of syntome would show S3 being the culprit ? 


IF S3 is indeed the issue what would be the best approach to this? Would transfering the entirety of the S3 data to the EMR itself make any noticeable improvements?

Master Collaborator

If you observe there are no jobs starting for an extended period of time, then I'd figure the driver is stuck listing S3 files. If you find the stage that reads the data from S3 is taking a long time, then I'd figure it's reading from S3. It's not certain that's the issue but certainly the place I'd look first.


Because the S3 support is basically from Hadoop, you might look at Are you using an s3a:// URI?

New Contributor

I have read somehwere that 'only' s3:// was supported by AWS EMR.. didn't even try using s3a. But we never know i might gonna give it a go

Master Collaborator

If you're asking about EMR, this is the wrong place -- that's an Amazon product.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.