Created 11-19-2016 11:19 PM
I have 2 data sets and the ask is to partial match of model in DS1 to title in DS2. I want to do this in spark/scala, can somebody help how to do the partial match
{"product_name":"Samsung_TL220","manufacturer":"Samsung","model":"TL220","announced-date":"2009-08-12T20:00:00.000-04:00"} {"title":"Genuine Samsung TL220 Galaxy ","manufacturer":"Samsung","currency":"CAD","price":"199.96"}
I created 2 dataframes and then registered as temp table and tried this but no luck
spark.sql("select * from list,prd where list.manufacturer = prd.manufacturer and list.title like '%'+prd.model+'%' ").show()
Created 11-22-2016 03:17 PM
You can use an User Defined Function
prd.show(10, false) +--------+------------+------+-----------------------------+ |currency|manufacturer|price |title | +--------+------------+------+-----------------------------+ |CAD |Samsung |199.96|Genuine Samsung TL220 Galaxy | |CAD |Apple |700.00|Apple iPhone 7 128GB | +--------+------------+------+-----------------------------+
list.show(10, false) +-----------------------------+------------+-----------+-------------+ |announced-date |manufacturer|model |product_name | +-----------------------------+------------+-----------+-------------+ |2009-08-12T20:00:00.000-04:00|Samsung |TL220 |Samsung_TL220| |2016-11-12T20:00:00.000-04:00|Apple |iPhone7_128|iPhone-7_128 | |2016-11-12T20:00:00.000-04:00|Apple |iPhone7_256|iPhone-7_256 | +-----------------------------+------------+-----------+-------------+
In my example the fuzzy match splits product_name and and only if each part is contained in title, the join will be done.: part
val fuzzyMatch = udf((val1: String, val2: String) => val1.split("[-_\\s]+") .map(val2 contains _) .reduce(_ & _) ) list.join(prd, fuzzyMatch(list("product_name"), prd("title"))) .show(20, false) +-----------------------------+------------+-----------+-------------+--------+------------+------+-----------------------------+ |announced-date |manufacturer|model |product_name |currency|manufacturer|price |title | +-----------------------------+------------+-----------+-------------+--------+------------+------+-----------------------------+ |2009-08-12T20:00:00.000-04:00|Samsung |TL220 |Samsung_TL220|CAD |Samsung |199.96|Genuine Samsung TL220 Galaxy | |2016-11-12T20:00:00.000-04:00|Apple |iPhone7_128|iPhone-7_128 |CAD |Apple |700.00|Apple iPhone 7 128GB | +-----------------------------+------------+-----------+-------------+--------+------------+------+-----------------------------+
The logic in the udf can be as you like (however it will have impact on performance)
Created 11-28-2016 04:19 PM
Thanks a lot Walter, this solution is perfect ....