Using CDH 5.16.1, Spark 2.2
Depending on the input parameters (input data and minSupport) for the org.apache.spark.ml.fpm.FPGrowth,
one can run a "greedy" Frequent Pattern Mining. "greedy" because huge number of patterns will be handled.
With Spark on Yarn, this leads to big usage of disk: shuffle temporary files generated in YARN application cache.
This high disk space taken can be problematic.
Is there any mean in term of configuration to limit this disk space usage ?
As a basic user of the FPGrowth API:
- i can not rdd.unpersist() inside FPGrowth()
- i can not delete intermediate files that could still be needed by FPGrowth()