Processing large CSV with 400000 rows and 100000 columns

New Contributor

I have a CSV dataset of 400000x100000. It has 500GB.

I have done the following :

df ='largefile.csv',header=Ture,maxCoulmns=100000)



and saved in

In terminal :


spark2-submit --master yarn --deploy-mode cluster

The spark job is running, I am able to track it in Spark UI and it is keep on running, after 30 minutes or so it is failing.


For testing purpose I have tried the above steps with 10 columns dataset and the job completed successfully. Is there any restrictions or configurations to increase the columns that Spark process ?


Expert Contributor



Please share the error on your failure.