Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark Mllib - Frequent Pattern Mining - strange results

avatar
Rising Star

Hi experts,

I have attached to this post dataset sample.txt
And I am trying to extract some association rules using Spark Mllib:
  1. val transactions = sc.textFile("DATA")

  2. import org.apache.spark.mllib.fpm.AssociationRules

  3. import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset

  4. val freqItemsets = transactions.map(_.split(",")).flatMap(xs => (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}

  5. val ar = new AssociationRules().setMinConfidence(0.8)

  6. val results = ar.run(freqItemsets)results.collect().foreach { rule => println("[" + rule.antecedent.mkString(",") + "=>" + rule.consequent.mkString(",") + "]," + rule.confidence)}

However, my code returns a dozen rules with confidence equal to 1 ... which makes little sense!

Does anyone know if I lack some parameterization?

1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
7 REPLIES 7

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Rising Star
Hi jfrazee, Many thanks for your response 🙂 I've some questions about this:
  1. 1) The structure of my data (Each line corresponds to a set of products_id) is correct to this algorithm?
  2. 2) The ".filter(_._2 > 2)" filter the products that have occurrence smaller than 2?
  3. 3) When I submit "val freqItemsets = transactions.map(_.split(",")).flatMap(xs => (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).filter(_._2 > 2).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}" I'm getting the following error: <console>:31: error: value _2 is not a member of Array[String] .Do you know how to solve it? Many thanks for your help and explaination about the association rules algorithm 🙂 And sorry for this questions.

avatar

@Pedro Rodgers Sorry, I meant to write to put the .filter() after the .reduceByKey(). Edited it in the original answer. It should run now. Yes, it's filtering out/eliminating occurrences with counts less than or equal to 2. If your data/training time isn't too big, you can probably tune that and your confidence level empirically using a grid search.

avatar
Rising Star

@jfrazee The normal is to reduce / remove products with little occurrence, right? It is reasonable to think about eliminating the products that appear only in 20% of all transactions?

avatar

That percentage will certainly vary by domain so I don't know what normal will be. I will note that to do that on a large data set you'll need a step in your job to approximate where the cutoff is, but that's easy enough using the sampling methods exposed on the RDD.

avatar
Rising Star

@jfrazee but I can define the cut-off the values that are higher than the average, right?

avatar

Yes. That's sort of what I had in mind, but it'll still depend on how balanced/imbalanced your data is. There are algorithms for doing this more intelligently too but I've never looked at how to do them in Spark. It looks like the FPGrowth() classes expose a support proportion, but I can't quite tell what it does if you have, e.g., 10k 1's and 100 items with count > 1. I probably can't take you much further without doing some reading.