Support Questions
Find answers, ask questions, and share your expertise

Spark results in quick-start tutorial

Highlighted

Spark results in quick-start tutorial

New Contributor

Hi.

Today I get cloudera/quickstart docker image from https://hub.docker.com/r/cloudera/quickstart/ run it and go to tutorial http://localhost:80/#/tutorial/getting_started .

In the end of the Tutorial Exercise 3 here results of the spark script that retrive which products are most commonly purchased together.

spark_shell_results.png

 

As you can see it's pair of (Nike Men's Dri-FIT Victory Golf Polo,Perfect Fitness Perfect Rip Deck) whith 67876 orders.

I try to compare it with sql query result, that done the same:

 

 

select o1.order_item_product_id, o2.order_item_product_id, count(*) cnt
from order_items o1
   join order_items o2
      on o2.order_item_order_id = o1.order_item_order_id
         and o2.order_item_product_id > o1.order_item_product_id
group by o1.order_item_product_id, o2.order_item_product_id
order by 3 desc
limit 10;
/* Result:
order_item_product_id	order_item_product_id	cnt
365	403	8375
365	502	7963
365	1014	7424
403	502	7294
403	1014	6607
365	1004	6597
502	1014	6323
403	1004	6055
365	1073	5761
502	1004	5689
*/


select product_id, product_name
from products
where product_id in(365, 403);
/*
product_id	product_name
365	Perfect Fitness Perfect Rip Deck
403	Nike Men's CJ Elite 2 TD Football Cleat
*/

 

 

 

Here i have another pair of products (Perfect Fitness Perfect Rip Deck, Nike Men's CJ Elite 2 TD Football Cleat) and 8375 orders with it.

So there is different results, when I expecded the same.

 

Maybe I have a mistake in sql query or spark script have not the same input rows, but I see now that 5-th row of spark result have the same product:

 

 

(39314,(Perfect Fitness Perfect Rip Deck,Perfect Fitness Perfect Rip Deck))

 

 

So, it's not a pair of products here, it's the same product.

 

My question is - which products are most commonly purchased together? Is it from sql query or from spark (and maybe there is a mistake in a spark script)?

 

spark script: (67876,(Nike Men's Dri-FIT Victory Golf Polo,Perfect Fitness Perfect Rip Deck))
sql script:   (8375,(Nike Men's CJ Elite 2 TD Football Cleat,Perfect Fitness Perfect Rip Deck))