Support Questions

Find answers, ask questions, and share your expertise

Can I join 2 dataframe with condition in column value?

avatar
Expert Contributor

I have 2 Dataframe and I would like to show the one of the dataframe if my conditions satishfied. I want to match the first column of both the DB and also the condition SEV_LVL='3'. Can I get some guidance or help please

scala> input_file.show()
+-----------+--------+-----+----+-------+
|     ckt_id|location|usage|port|machine|
+-----------+--------+-----+----+-------+
|     ckt_id|location|usage|port|machine|
| AXZCSD21DF|     USA|  2GB| 101|   MAC1|
| ABZCSD21DF|     OTH|  4GB| 101|   MAC2|
| AXZCSD21DF|     USA|  6GB| 101|   MAC4|
| BXZCSD21DF|     USA|  7GB| 101|   MAC6|
| CXZCSD21DF|     IND|  2GB| 101|   MAC9|
| AXZCSD21DF|     USA|  1GB| 101|   MAC0|
| AXZCSD22DF|     IND|  9GB| 101|   MAC3|
|ADZZCSD21DF|     USA|  1GB| 101|   MAC4|
| AXZCSD21DF|     USA|  2GB| 101|   MAC5|
| XZDCSD21DF|     OTH|  2GB| 101|   MAC1|
+-----------+--------+-----+----+-------+

scala> gsam.show()
+-----------+-------+
|    CCKT_NO|SEV_LVL|
+-----------+-------+
| AXZCSD21DF|      1|
| BXZCSD21DF|      1|
| ABZCSD21DF|      3|
| CXZCSD21DF|      2|
| AXZCSD22DF|      2|
| XZDCSD21DF|      3|
|ADZZCSD21DF|      1|
+-----------+-------+





scala> val gsamjoin = gsam.join(input_file,(gsam("CCKT_NO") <=> input_file("ckt_id")));
gsamjoin: org.apache.spark.sql.DataFrame = [CCKT_NO: string, SEV_LVL: decimal(38,0), ckt_id: string, location: string, usage: string, port: string, machine: string]


scala> gsamjoin.show()
+-----------+-------+-----------+--------+-----+----+-------+
|    CCKT_NO|SEV_LVL|     ckt_id|location|usage|port|machine|
+-----------+-------+-----------+--------+-----+----+-------+
| CXZCSD21DF|      2| CXZCSD21DF|     IND|  2GB| 101|   MAC9|
| ABZCSD21DF|      3| ABZCSD21DF|     OTH|  4GB| 101|   MAC2|
| XZDCSD21DF|      3| XZDCSD21DF|     OTH|  2GB| 101|   MAC1|
| AXZCSD22DF|      2| AXZCSD22DF|     IND|  9GB| 101|   MAC3|
|ADZZCSD21DF|      1|ADZZCSD21DF|     USA|  1GB| 101|   MAC4|
| BXZCSD21DF|      1| BXZCSD21DF|     USA|  7GB| 101|   MAC6|
| AXZCSD21DF|      1| AXZCSD21DF|     USA|  2GB| 101|   MAC1|
| AXZCSD21DF|      1| AXZCSD21DF|     USA|  6GB| 101|   MAC4|
| AXZCSD21DF|      1| AXZCSD21DF|     USA|  1GB| 101|   MAC0|
| AXZCSD21DF|      1| AXZCSD21DF|     USA|  2GB| 101|   MAC5|
+-----------+-------+-----------+--------+-----+----+-------+
1 ACCEPTED SOLUTION

avatar
Super Collaborator

Definitely possible! Here is some sample code:

gsam.join(input_file, (gsam("CCKT_NO")===input_file("ckt_id")) && (gsam("SEV_LVL") === 3)), "inner")

Notice the double && sign. You can put as many conditions as you'd like in.

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

Definitely possible! Here is some sample code:

gsam.join(input_file, (gsam("CCKT_NO")===input_file("ckt_id")) && (gsam("SEV_LVL") === 3)), "inner")

Notice the double && sign. You can put as many conditions as you'd like in.

avatar
Expert Contributor

@Joe Widen

Thank you Sir, But I think if we do join for a larger dataset memory issues will happen. So in such case can we use if/else or look up function here .

My Aim is to match input_file DFwith gsam DF and if CCKT_NO = ckt_id and SEV_LVL = 3 then print complete row for that ckt_id.