Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

limit disk space used by org.apache.spark.ml.fpm.FPGrowth ?

limit disk space used by org.apache.spark.ml.fpm.FPGrowth ?

New Contributor

Using CDH 5.16.1, Spark 2.2

 

Depending on the input parameters (input data and minSupport) for the org.apache.spark.ml.fpm.FPGrowth,
one can run a "greedy" Frequent Pattern Mining. "greedy" because huge number of patterns will be handled.
With Spark on Yarn, this leads to big usage of disk: shuffle temporary files generated in YARN application cache.
This high disk space taken can be problematic.

 

Is there any mean in term of configuration to limit this disk space usage ?

2 REPLIES 2

Re: limit disk space used by org.apache.spark.ml.fpm.FPGrowth ?

Expert Contributor

Hello @dida,

 

Nice use of word greedy :)

 

You may want to use rdd.unpersist(). This also will help for intermediate files to be removed.

Re: limit disk space used by org.apache.spark.ml.fpm.FPGrowth ?

New Contributor

As a basic user of the FPGrowth API:

- i can not rdd.unpersist() inside FPGrowth()

- i can not delete intermediate files that could still be needed by FPGrowth()