About PriPa

PriPa · ‎02-21-2020

Hi, I want to process data from the HDFS file using the Spark Java code. While processing files, I am performing simple transformation such as replace a new line with space and find patterns using regex from the file. I used the wholeTextFiles method to read data from HDFS files but it took 2 hours to process only 4 MB files. I tried to increase spark executor memory to 15g with 4 executor instances still it took 2 hours. I have 1 master with 56GiB memory,8 cores, and 3 workers with 28 GiB memory,8 cores. How to improve the performance of the spark job using the above nodes configurations. Thanks,

Online	Offline
Last Visited	‎11-25-2020 11:20 AM

Member Since	‎12-06-2019 06:42 AM
Last Visited	‎11-25-2020 11:20 AM
Posts	6

Cloudera Community

Spark job takes 2 hours to read data from HDFS usi...