I need to create feature dataset from huge koggle data. There are 6 to 7 huge files and highest of the size is 7GB. I am just using standard operations, reading the data from the parquet files, and then creating DF and performing joins as per the requirement.
I have 6 Nodes cluster and it is having sufficient memory. However, when i run the job using spark stanalone clustermanager with maximum memory and cores, the job is failing. The same steps that i have done are working fine in spark-shell however.
I have searched enough to find what configurations might have gone wrong and i have used same configurations that i used with Spark-shell, and used cache and persist both types and still, ended up receiving "No space Left on Disk" or "failing in the middle".. What could have been wrong.? and why the same operations running and completing without any fuss in spark-shell.? I have not changed a bit of code from shell to my app.
Have you tried using YARN instead of Spark's cluster manager? Try "--master yarn". If you post your command-line arguments it might help give us some more clues.