Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

partial match of text among 2 datasets

partial match of text among 2 datasets

New Contributor

I have 2 data sets and the ask is to partial match of model in DS1 to title in DS2. I want to do this in spark/scala, can somebody help how to do the partial match

{"product_name":"Samsung_TL220","manufacturer":"Samsung","model":"TL220","announced-date":"2009-08-12T20:00:00.000-04:00"}  

{"title":"Genuine Samsung TL220 Galaxy ","manufacturer":"Samsung","currency":"CAD","price":"199.96"}

I created 2 dataframes and then registered as temp table and tried this but no luck

 spark.sql("select * from list,prd where list.manufacturer = prd.manufacturer and list.title like '%'+prd.model+'%' ").show()
2 REPLIES 2

Re: partial match of text among 2 datasets

You can use an User Defined Function

prd.show(10, false)

+--------+------------+------+-----------------------------+
|currency|manufacturer|price |title                        |
+--------+------------+------+-----------------------------+
|CAD     |Samsung     |199.96|Genuine Samsung TL220 Galaxy |
|CAD     |Apple       |700.00|Apple iPhone 7 128GB         |
+--------+------------+------+-----------------------------+
list.show(10, false)

+-----------------------------+------------+-----------+-------------+
|announced-date               |manufacturer|model      |product_name |
+-----------------------------+------------+-----------+-------------+
|2009-08-12T20:00:00.000-04:00|Samsung     |TL220      |Samsung_TL220|
|2016-11-12T20:00:00.000-04:00|Apple       |iPhone7_128|iPhone-7_128 |
|2016-11-12T20:00:00.000-04:00|Apple       |iPhone7_256|iPhone-7_256 |
+-----------------------------+------------+-----------+-------------+

In my example the fuzzy match splits product_name and and only if each part is contained in title, the join will be done.: part

val fuzzyMatch = udf((val1: String, val2: String) => 
    val1.split("[-_\\s]+")
        .map(val2 contains _)
       .reduce(_ & _)
)
list.join(prd, fuzzyMatch(list("product_name"), prd("title")))
    .show(20, false)

+-----------------------------+------------+-----------+-------------+--------+------------+------+-----------------------------+
|announced-date               |manufacturer|model      |product_name |currency|manufacturer|price |title                        |
+-----------------------------+------------+-----------+-------------+--------+------------+------+-----------------------------+
|2009-08-12T20:00:00.000-04:00|Samsung     |TL220      |Samsung_TL220|CAD     |Samsung     |199.96|Genuine Samsung TL220 Galaxy |
|2016-11-12T20:00:00.000-04:00|Apple       |iPhone7_128|iPhone-7_128 |CAD     |Apple       |700.00|Apple iPhone 7 128GB         |
+-----------------------------+------------+-----------+-------------+--------+------------+------+-----------------------------+

The logic in the udf can be as you like (however it will have impact on performance)

Highlighted

Re: partial match of text among 2 datasets

New Contributor

Thanks a lot Walter, this solution is perfect ....