Created 09-28-2016 11:53 AM
Hi experts,
val transactions = sc.textFile("DATA")
import org.apache.spark.mllib.fpm.AssociationRules
import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset
val freqItemsets = transactions.map(_.split(",")).flatMap(xs => (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}
val ar = new AssociationRules().setMinConfidence(0.8)
val results = ar.run(freqItemsets)results.collect().foreach { rule => println("[" + rule.antecedent.mkString(",") + "=>" + rule.consequent.mkString(",") + "]," + rule.confidence)}
Does anyone know if I lack some parameterization?
Created 09-28-2016 04:45 PM
@Pedro Rodgers Three things jump out at me:
I re-ran your code including a .filter(_._2 > 2) after the .reduceByKey(_ + _) and lowered the confidence to 0.6 and I get 20 or so rules now with confidence varying between 0.6 and 1.0.
I suspect if you carefully go through the results you were getting before you might see that it was just learning a 1-to-1 mapping between input and output, so the confidence of 1.0 is justified, but the generalization of the model is bad.
Created 09-28-2016 04:45 PM
@Pedro Rodgers Three things jump out at me:
I re-ran your code including a .filter(_._2 > 2) after the .reduceByKey(_ + _) and lowered the confidence to 0.6 and I get 20 or so rules now with confidence varying between 0.6 and 1.0.
I suspect if you carefully go through the results you were getting before you might see that it was just learning a 1-to-1 mapping between input and output, so the confidence of 1.0 is justified, but the generalization of the model is bad.
Created 09-28-2016 10:59 PM
Created 09-28-2016 11:33 PM
@Pedro Rodgers Sorry, I meant to write to put the .filter() after the .reduceByKey(). Edited it in the original answer. It should run now. Yes, it's filtering out/eliminating occurrences with counts less than or equal to 2. If your data/training time isn't too big, you can probably tune that and your confidence level empirically using a grid search.
Created 09-29-2016 09:37 AM
@jfrazee The normal is to reduce / remove products with little occurrence, right? It is reasonable to think about eliminating the products that appear only in 20% of all transactions?
Created 09-29-2016 03:47 PM
That percentage will certainly vary by domain so I don't know what normal will be. I will note that to do that on a large data set you'll need a step in your job to approximate where the cutoff is, but that's easy enough using the sampling methods exposed on the RDD.
Created 09-29-2016 04:34 PM
@jfrazee but I can define the cut-off the values that are higher than the average, right?
Created 09-29-2016 05:07 PM
Yes. That's sort of what I had in mind, but it'll still depend on how balanced/imbalanced your data is. There are algorithms for doing this more intelligently too but I've never looked at how to do them in Spark. It looks like the FPGrowth() classes expose a support proportion, but I can't quite tell what it does if you have, e.g., 10k 1's and 100 items with count > 1. I probably can't take you much further without doing some reading.