I have a CSV dataset of 400000x100000. It has 500GB.
I have done the following :
df = spark.read.csv('largefile.csv',header=Ture,maxCoulmns=100000)
and saved in spark_test.py
In terminal :
spark2-submit --master yarn --deploy-mode cluster spark_test.py
The spark job is running, I am able to track it in Spark UI and it is keep on running, after 30 minutes or so it is failing.
For testing purpose I have tried the above steps with 10 columns dataset and the job completed successfully. Is there any restrictions or configurations to increase the columns that Spark process ?
Please share the error on your failure.