Support Questions

Find answers, ask questions, and share your expertise

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Cloudera Community
- :
- Support
- :
- Support Questions
- :
- Spark Mllib - Frequent Pattern Mining - strange r...

Announcements

Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Labels:

Explorer

Created 09-28-2016 11:53 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi experts,

I have attached to this post dataset sample.txt

And I am trying to extract some association rules using Spark Mllib:
val transactions = sc.textFile("DATA")

import org.apache.spark.mllib.fpm.AssociationRules

import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset

val freqItemsets = transactions.map(_.split(",")).flatMap(xs => (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}

val ar = new AssociationRules().setMinConfidence(0.8)

val results = ar.run(freqItemsets)results.collect().foreach { rule => println("[" + rule.antecedent.mkString(",") + "=>" + rule.consequent.mkString(",") + "]," + rule.confidence)}

However, my code returns a dozen rules with confidence equal to 1 ... which makes little sense!

Does anyone know if I lack some parameterization?

1 ACCEPTED SOLUTION

Accepted Solutions

Highlighted

Guru

Created 09-28-2016 04:45 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

@Pedro Rodgers Three things jump out at me:

- There are a ton of examples in your sample data with just one or two occurrences of the pattern. For this particular algorithm and its usual application they're not going to be very useful.
- Your confidence is quite high, considering the size of the sample data and the evidence for the different patterns.
- The learner came up with 1165 rules for 1185 data points.

I re-ran your code including a .filter(_._2 > 2) after the .reduceByKey(_ + _) and lowered the confidence to 0.6 and I get 20 or so rules now with confidence varying between 0.6 and 1.0.

I suspect if you carefully go through the results you were getting before you might see that it was just learning a 1-to-1 mapping between input and output, so the confidence of 1.0 is justified, but the generalization of the model is bad.

7 REPLIES 7

Highlighted

Guru

Created 09-28-2016 04:45 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

@Pedro Rodgers Three things jump out at me:

- There are a ton of examples in your sample data with just one or two occurrences of the pattern. For this particular algorithm and its usual application they're not going to be very useful.
- Your confidence is quite high, considering the size of the sample data and the evidence for the different patterns.
- The learner came up with 1165 rules for 1185 data points.

I re-ran your code including a .filter(_._2 > 2) after the .reduceByKey(_ + _) and lowered the confidence to 0.6 and I get 20 or so rules now with confidence varying between 0.6 and 1.0.

I suspect if you carefully go through the results you were getting before you might see that it was just learning a 1-to-1 mapping between input and output, so the confidence of 1.0 is justified, but the generalization of the model is bad.

Highlighted
##

Hi jfrazee,
Many thanks for your response :) I've some questions about this:

Re: Spark Mllib - Frequent Pattern Mining - strange results

Explorer

Created 09-28-2016 10:59 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- 1) The structure of my data (Each line corresponds to a set of products_id) is correct to this algorithm?
- 2) The ".filter(_._2 > 2)" filter the products that have occurrence smaller than 2?
- 3) When I submit "val freqItemsets = transactions.map(_.split(",")).flatMap(xs => (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).filter(_._2 > 2).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}" I'm getting the following error:
**<console>:31: error: value _2 is not a member of Array[String] .**Do you know how to solve it? Many thanks for your help and explaination about the association rules algorithm :) And sorry for this questions.

Highlighted
##

@Pedro Rodgers Sorry, I meant to write to put the .filter() after the .reduceByKey(). Edited it in the original answer. It should run now. Yes, it's filtering out/eliminating occurrences with counts less than or equal to 2. If your data/training time isn't too big, you can probably tune that and your confidence level empirically using a grid search.

Re: Spark Mllib - Frequent Pattern Mining - strange results

Guru

Created 09-28-2016 11:33 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Highlighted
##

@jfrazee The normal is to reduce / remove products with little occurrence, right? It is reasonable to think about eliminating the products that appear only in 20% of all transactions?

Re: Spark Mllib - Frequent Pattern Mining - strange results

Explorer

Created 09-29-2016 09:37 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Highlighted
##

That percentage will certainly vary by domain so I don't know what normal will be. I will note that to do that on a large data set you'll need a step in your job to approximate where the cutoff is, but that's easy enough using the sampling methods exposed on the RDD.

Re: Spark Mllib - Frequent Pattern Mining - strange results

Guru

Created 09-29-2016 03:47 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Highlighted
##

Re: Spark Mllib - Frequent Pattern Mining - strange results

Explorer

Created 09-29-2016 04:34 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

@jfrazee but I can define the cut-off the values that are higher than the average, right?

Highlighted
##

Yes. That's sort of what I had in mind, but it'll still depend on how balanced/imbalanced your data is. There are algorithms for doing this more intelligently too but I've never looked at how to do them in Spark. It looks like the FPGrowth() classes expose a support proportion, but I can't quite tell what it does if you have, e.g., 10k 1's and 100 items with count > 1. I probably can't take you much further without doing some reading.

Re: Spark Mllib - Frequent Pattern Mining - strange results

Guru

Created 09-29-2016 05:07 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Coming from Hortonworks? Activate your account here