Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

limit disk space used by ?


limit disk space used by ?


Using CDH 5.16.1, Spark 2.2


Depending on the input parameters (input data and minSupport) for the,
one can run a "greedy" Frequent Pattern Mining. "greedy" because huge number of patterns will be handled.
With Spark on Yarn, this leads to big usage of disk: shuffle temporary files generated in YARN application cache.
This high disk space taken can be problematic.


Is there any mean in term of configuration to limit this disk space usage ?


Re: limit disk space used by ?

Expert Contributor

Hello @dida,


Nice use of word greedy :)


You may want to use rdd.unpersist(). This also will help for intermediate files to be removed.

Re: limit disk space used by ?


As a basic user of the FPGrowth API:

- i can not rdd.unpersist() inside FPGrowth()

- i can not delete intermediate files that could still be needed by FPGrowth()

Don't have an account?
Coming from Hortonworks? Activate your account here