Reply
New Contributor
Posts: 1
Registered: ‎12-26-2017

Processing large CSV with 400000 rows and 100000 columns

I have a CSV dataset of 400000x100000. It has 500GB.


I have done the following :

df = spark.read.csv('largefile.csv',header=Ture,maxCoulmns=100000)

 

print(df.count()) 

and saved in spark_test.py


In terminal :

 

spark2-submit --master yarn --deploy-mode cluster spark_test.py

The spark job is running, I am able to track it in Spark UI and it is keep on running, after 30 minutes or so it is failing.

 

For testing purpose I have tried the above steps with 10 columns dataset and the job completed successfully. Is there any restrictions or configurations to increase the columns that Spark process ?

Highlighted
Contributor
Posts: 52
Registered: ‎01-08-2016

Re: Processing large CSV with 400000 rows and 100000 columns

Hi,

 

Please share the error on your failure.

Announcements