Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to speed up wholeTextFile processing on CDH5.5?

How to speed up wholeTextFile processing on CDH5.5?

New Contributor

I am trying to read 8k files in 10 directories with sc.wholeTextFile but failing :-)

 

1) When I run the code on all directores at once I get out of memory exception

val rdd = sc.wholeTextFile("/mySource/*/*")
rdd.count()

2) When I run the code on a single directory with only 1200 files it works but taks about >12 min

val rdd = sc.wholeTextFile("/mySource/dir1/*")
rdd.count()

I tried adding more partitions but got back to out of memroy exception

val rdd = sc.wholeTextFile("/mySource/dir1/*",8) //also tired 4,32 all failed
rdd.count()

my configuration is 5 servers with 24 cores 24GB and I am using

spark-shell --master yarn-client --executor-core 5 --executor-memory 5G

 Files are text files.

 

Any help is appriciated!

Eran