I'm trying to use my (quite huge cluster) as much as I can but it seems to me that no matter how much node I throw in the time to finish processing the data doesn't change much. It takes over 3 hours to complete which I'm sure can be reduced drasically..
Briefly my script does these spark steps:
data = spark_context.textFile(s3_bucket)
data_after_map = logs.map(lambda x: somestuff(x)))
flat_map = data_after_map.flatMap(lambda x: x)
result =flat_map.reduceByKey(lambda a, b: (a + b, a + b))
result = result.map(lambda a: my_parsing(a))
result = sqlContext.createDataFrame(...)
The Setup on AWS is:
1 Master: 8 vCPU, 15 GiB memory, 80 SSD GB storage
20 nodes with each node having 36 vCPU, 60 GiB memory, 32Gb of storage
When I Check on Hadoop UI it says:
Memory total: 1.02TB
Memory used: 972.25GB
vCore total: 720
vCore used: 685
on the spark UI it says:
Is there something from the above details that could lead to a counter productive settings ?
You're almost certainly bottlenecked on reading for S3, or listing the S3 directories, or both. You'd have to examine the stats in the Spark UI to really know. Given your simple job, it's almost certainly nothing to do with memory or CPU, or the shuffle. Try parallelizing more with more, smaller input files? splitting across more workers to make more bandwidth available?
I'm also not clear if you're using a Cloudera cluster here.
Yes I thought about S3 being the issue as well but didn't know how to verify this.
Could you point me to where I could check this on the Spark-UI ? Or at least what kind of syntome would show S3 being the culprit ?
IF S3 is indeed the issue what would be the best approach to this? Would transfering the entirety of the S3 data to the EMR itself make any noticeable improvements?
If you observe there are no jobs starting for an extended period of time, then I'd figure the driver is stuck listing S3 files. If you find the stage that reads the data from S3 is taking a long time, then I'd figure it's reading from S3. It's not certain that's the issue but certainly the place I'd look first.
Because the S3 support is basically from Hadoop, you might look at https://wiki.apache.org/hadoop/AmazonS3 Are you using an s3a:// URI?
I have read somehwere that 'only' s3:// was supported by AWS EMR.. didn't even try using s3a. But we never know i might gonna give it a go