Support Questions
Find answers, ask questions, and share your expertise

Spark job takes 2 hours to read data from HDFS using wholeTextfiles

Spark job takes 2 hours to read data from HDFS using wholeTextfiles

Explorer

Hi,

I want to process data from the HDFS file using the Spark Java code. While processing files, I am performing simple transformation such as replace a new line with space and find patterns using regex from the file. I used the wholeTextFiles method to read data from HDFS files but it took 2 hours to process only 4 MB files. I tried to increase spark executor memory to 15g with 4 executor instances still it took 2 hours.

I have 1 master with 56GiB memory,8 cores, and 3 workers with 28 GiB memory,8 cores.

How to improve the performance of the spark job using the above nodes configurations.

 

Thanks,

1 REPLY 1

Re: Spark job takes 2 hours to read data from HDFS using wholeTextfiles

Rising Star

Hi,

 

I understand that you have a spark java code, Which is taking 2 hours to process 4MB of data and you like to improve the performance of this application.

 

I recommend you to check the below documents, Which helps in performance tuning both in code and configuration level.

 

https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-1/

https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/

 

Thanks

Jerry