Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Can I join 2 dataframe with condition in column value?

avatar
Expert Contributor

I have 2 Dataframe and I would like to show the one of the dataframe if my conditions satishfied. I want to match the first column of both the DB and also the condition SEV_LVL='3'. Can I get some guidance or help please

scala> input_file.show()
+-----------+--------+-----+----+-------+
|     ckt_id|location|usage|port|machine|
+-----------+--------+-----+----+-------+
|     ckt_id|location|usage|port|machine|
| AXZCSD21DF|     USA|  2GB| 101|   MAC1|
| ABZCSD21DF|     OTH|  4GB| 101|   MAC2|
| AXZCSD21DF|     USA|  6GB| 101|   MAC4|
| BXZCSD21DF|     USA|  7GB| 101|   MAC6|
| CXZCSD21DF|     IND|  2GB| 101|   MAC9|
| AXZCSD21DF|     USA|  1GB| 101|   MAC0|
| AXZCSD22DF|     IND|  9GB| 101|   MAC3|
|ADZZCSD21DF|     USA|  1GB| 101|   MAC4|
| AXZCSD21DF|     USA|  2GB| 101|   MAC5|
| XZDCSD21DF|     OTH|  2GB| 101|   MAC1|
+-----------+--------+-----+----+-------+

scala> gsam.show()
+-----------+-------+
|    CCKT_NO|SEV_LVL|
+-----------+-------+
| AXZCSD21DF|      1|
| BXZCSD21DF|      1|
| ABZCSD21DF|      3|
| CXZCSD21DF|      2|
| AXZCSD22DF|      2|
| XZDCSD21DF|      3|
|ADZZCSD21DF|      1|
+-----------+-------+





scala> val gsamjoin = gsam.join(input_file,(gsam("CCKT_NO") <=> input_file("ckt_id")));
gsamjoin: org.apache.spark.sql.DataFrame = [CCKT_NO: string, SEV_LVL: decimal(38,0), ckt_id: string, location: string, usage: string, port: string, machine: string]


scala> gsamjoin.show()
+-----------+-------+-----------+--------+-----+----+-------+
|    CCKT_NO|SEV_LVL|     ckt_id|location|usage|port|machine|
+-----------+-------+-----------+--------+-----+----+-------+
| CXZCSD21DF|      2| CXZCSD21DF|     IND|  2GB| 101|   MAC9|
| ABZCSD21DF|      3| ABZCSD21DF|     OTH|  4GB| 101|   MAC2|
| XZDCSD21DF|      3| XZDCSD21DF|     OTH|  2GB| 101|   MAC1|
| AXZCSD22DF|      2| AXZCSD22DF|     IND|  9GB| 101|   MAC3|
|ADZZCSD21DF|      1|ADZZCSD21DF|     USA|  1GB| 101|   MAC4|
| BXZCSD21DF|      1| BXZCSD21DF|     USA|  7GB| 101|   MAC6|
| AXZCSD21DF|      1| AXZCSD21DF|     USA|  2GB| 101|   MAC1|
| AXZCSD21DF|      1| AXZCSD21DF|     USA|  6GB| 101|   MAC4|
| AXZCSD21DF|      1| AXZCSD21DF|     USA|  1GB| 101|   MAC0|
| AXZCSD21DF|      1| AXZCSD21DF|     USA|  2GB| 101|   MAC5|
+-----------+-------+-----------+--------+-----+----+-------+
1 ACCEPTED SOLUTION

avatar
Super Collaborator

Definitely possible! Here is some sample code:

gsam.join(input_file, (gsam("CCKT_NO")===input_file("ckt_id")) && (gsam("SEV_LVL") === 3)), "inner")

Notice the double && sign. You can put as many conditions as you'd like in.

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

Definitely possible! Here is some sample code:

gsam.join(input_file, (gsam("CCKT_NO")===input_file("ckt_id")) && (gsam("SEV_LVL") === 3)), "inner")

Notice the double && sign. You can put as many conditions as you'd like in.

avatar
Expert Contributor

@Joe Widen

Thank you Sir, But I think if we do join for a larger dataset memory issues will happen. So in such case can we use if/else or look up function here .

My Aim is to match input_file DFwith gsam DF and if CCKT_NO = ckt_id and SEV_LVL = 3 then print complete row for that ckt_id.