Support Questions

Find answers, ask questions, and share your expertise

Groceries - Supersied algorithm proposal in Spark 1.6

avatar
Contributor

Hi experts,

Having a dataset with information about the customer and the products that they buy from a set of supermarkets:

- Customer_ID

- Transaction_Number

- Store_ID

- Product_Name

- Transaction_Date

- Product_Value

Which supervisied algorithm did you recommend to create a use case in Spark 1.6?

I'm very new at this topic 🙂

Many thanks!!

1 ACCEPTED SOLUTION

avatar

@Johnny Fugers

This could go a few different ways depending on the business objective. Is there a reason why you are asking about a supervised approach? If you are open to any ML approach, here's what I'd recommend...

Recommendation Engine: The first thing that comes to mind is to use a recommendation engine. In spark 1.6, there's a collaborative filtering algorithm. This algorithm makes a recommendation based on a user's behavior and their similarity to other users.

Frequent Patterns: Another interesting option, would be to perform frequent pattern mining. This examples provides a way to identify frequent item sets and association rules.

Clustering: There are several additional clustering algorithms in Spark, which you can leverage to identify similar customer cohorts. You could also look to identify the total value of these cohorts based on their overall spending, or you could try to identify who they are based on purchases (i.e. single adults, elderly, parents, etc.).

You might also want to check out time series analysis / forecasting if you want to predict spending, growth/seasonal patterns, etc.

So most of my recommendations are to use an unsupervised approach, but that is what is a good fit for this type of data and use cases similar to this.

..If you'd prefer to take a supervised approach, you could always use a decision tree, random forest, or similar and try to predict the purchase value based on store, time of day, day of week, customer, and maybe even through in a customer segment variable (which you can derive using the above approach 🙂 ).

View solution in original post

5 REPLIES 5

avatar
Expert Contributor

A typical use-case for this sort of data would be to recommend items to a customer, based on what similar customers have purchased.

In 2001, Amazon introduced item-based collaborative filtering and it's still popular today. This short, and very accessible IEEE paper describes the technique.

There's a good practical example of collaborative filtering in the Spark docs: https://spark.apache.org/docs/1.6.2/mllib-collaborative-filtering.html

avatar
Contributor

@Alex Woolford thanks :).do you recommend apply collective filtering with association rules? Imagine that I Predict that Customer A will buy product 1 and I discover that Who buys products 1 also buy product 2, so I can recommend to customer A the products 1 and 2...do you think that this a good approach?

avatar

@Johnny Fugers

This could go a few different ways depending on the business objective. Is there a reason why you are asking about a supervised approach? If you are open to any ML approach, here's what I'd recommend...

Recommendation Engine: The first thing that comes to mind is to use a recommendation engine. In spark 1.6, there's a collaborative filtering algorithm. This algorithm makes a recommendation based on a user's behavior and their similarity to other users.

Frequent Patterns: Another interesting option, would be to perform frequent pattern mining. This examples provides a way to identify frequent item sets and association rules.

Clustering: There are several additional clustering algorithms in Spark, which you can leverage to identify similar customer cohorts. You could also look to identify the total value of these cohorts based on their overall spending, or you could try to identify who they are based on purchases (i.e. single adults, elderly, parents, etc.).

You might also want to check out time series analysis / forecasting if you want to predict spending, growth/seasonal patterns, etc.

So most of my recommendations are to use an unsupervised approach, but that is what is a good fit for this type of data and use cases similar to this.

..If you'd prefer to take a supervised approach, you could always use a decision tree, random forest, or similar and try to predict the purchase value based on store, time of day, day of week, customer, and maybe even through in a customer segment variable (which you can derive using the above approach 🙂 ).

avatar
Contributor

@Dan Zaratsian many thanks for your response:) When you mean Predict purchase is like "person 1 will by products A and B"?

avatar

@Johnny Fugers In this context, "predicting purchase", could mean a few different things (and ways that we could go about it). For example, if you are interested in predicting whether person 1 will purchase product A, then you can look their purchase history and/or you can look at similar purchases across a segment of customers.

In the first scenario, you are basically working with probabilities (i.e. If I buy peanut butter every time I go to the store, then there's a high probability that I'll buy it on my next visit). Your predictive model should also take in to consideration other factors such as time of day, month (seasonality), storesID, etc. If you create a model for every customer, this could get expensive from a compute standpoint, so that is why many organizations segment customers into groups/cohorts based on behavior similarities. Predictive models are then built against these segments.

A second approach would be to use market basket analysis. For example, when customer A purchases cereal, how likely are they to purchase milk. This factors in purchases across a segment of customers to look for "baskets" of similar purchases.