- last edited on
Today I get cloudera/quickstart docker image from https://hub.docker.com/r/cloudera/quickstart/ run it and go to tutorial http://localhost:80/#/tutorial/getting_started .
In the end of the Tutorial Exercise 3 here results of the spark script that retrive which products are most commonly purchased together.
As you can see it's pair of (Nike Men's Dri-FIT Victory Golf Polo,Perfect Fitness Perfect Rip Deck) whith 67876 orders.
I try to compare it with sql query result, that done the same:
select o1.order_item_product_id, o2.order_item_product_id, count(*) cnt
from order_items o1
join order_items o2
on o2.order_item_order_id = o1.order_item_order_id
and o2.order_item_product_id > o1.order_item_product_id
group by o1.order_item_product_id, o2.order_item_product_id
order by 3 desc
order_item_product_id order_item_product_id cnt
365 403 8375
365 502 7963
365 1014 7424
403 502 7294
403 1014 6607
365 1004 6597
502 1014 6323
403 1004 6055
365 1073 5761
502 1004 5689
select product_id, product_name
where product_id in(365, 403);
365 Perfect Fitness Perfect Rip Deck
403 Nike Men's CJ Elite 2 TD Football Cleat
Here i have another pair of products (Perfect Fitness Perfect Rip Deck, Nike Men's CJ Elite 2 TD Football Cleat) and 8375 orders with it.
So there is different results, when I expecded the same.
Maybe I have a mistake in sql query or spark script have not the same input rows, but I see now that 5-th row of spark result have the same product:
(39314,(Perfect Fitness Perfect Rip Deck,Perfect Fitness Perfect Rip Deck))
So, it's not a pair of products here, it's the same product.
My question is - which products are most commonly purchased together? Is it from sql query or from spark (and maybe there is a mistake in a spark script)?
spark script: (67876,(Nike Men's Dri-FIT Victory Golf Polo,Perfect Fitness Perfect Rip Deck))
sql script: (8375,(Nike Men's CJ Elite 2 TD Football Cleat,Perfect Fitness Perfect Rip Deck))