Support Questions
Find answers, ask questions, and share your expertise

Compare 2 dataframe aand generate single text file using scala

Compare 2 dataframe aand generate single text file using scala

Rising Star

I have 2 tables in sql DB. bothe the records have a common field location. I want to compare 2 tables .

I would like to show all the data from tab1 (where tab1.location=tab2.location AND tab2.pole= 'N')

tab1 = 25000 no of records

tab2 = 15 no of of records

Everything I need in scala . took more than 40 mins for a single comparision. can someone help me here

tab1>>
+---+---+------+--------+-----+--------+
| ID|phno.|location|address|name|temp
+---+---+---------------+------+--------+
|  1|  9xx|   IND|  Street no1| 30F
|  2|  5xx|   USA|  RHGD no 23| 23F
|  3|  4xx|   RUS|  YWKY 58| 20F
+---+---+------+---+-----+-----+-----+


tab2>>
+---+---+------+--------+
| ID|phno.|location|pole|
+---+---+---------------+
|  1|  9xx|   IND|  S|
|  2|  5xx|   USA|  N|
|  3|  4xx|   RUS|  N|
+---+---+------+---+
5 REPLIES 5

Re: Compare 2 dataframe aand generate single text file using scala

Expert Contributor

One suggestion is to use some of the native Dataframe APIs, for example you could use the intersect API as a way of performing an inner join, look at using the Filter API against the large table/dataframe, i.e. dataframe.filter($location='USA'), etc. This isn't joining the tables, but would assist in understanding the performance of Spark. 40 mins for these small volumes, sounds like something is amiss. What versions are you using here?

Would also run some primitive tests to validate if there is something fundamentally wrong with the env.

Re: Compare 2 dataframe aand generate single text file using scala

Rising Star

Spark version 1.6.2

Re: Compare 2 dataframe aand generate single text file using scala

Expert Contributor

Hi @Dinesh Das

You need to run a broadcast hash join as one of your dataset is too small compared to the other. In a Broadcast Hash Join the small dataset is broadcasted to each partition of the Large Dataset. Thich leads to a highly efficient and super-fast join.

import org.apache.spark.sql.functions.broadcast

val joined_tables = tab1.join(broadcast(tab2), "key")

This will definitely speed up your job.

Re: Compare 2 dataframe aand generate single text file using scala

Rising Star

@Adnan Alvee

am not looking for joining 2 tables. I am looking for a compariosion between two DFs and generate a new DF and save it to a single text file . in my case its all getting saved as a parquet file.

Re: Compare 2 dataframe aand generate single text file using scala

Expert Contributor

I don't know of any other way of comparing two dataframe other than joining first. Here is your action item.

1. Join tab1 and tab2 using broadcast hash by column key "locaton".

2. Filter column by pole = N

Here is my code from your sample data. Paste it in a spark shell and see.

import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.udf

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._

case class tab1x(id:Int,phno:Long,location:String,address:String,name:String,temp:String)
case class tab2x(id:Int,phno:Long,location:String,pole:String)

val tab1 = Seq(
              tab1x(1,656,"IND","Street no1","X","30F"),
              tab1x(2,657,"USA","RHGD no 23","Y","23F"),
              tab1x(3,658,"RUS","YWKY 58","Z", "20F")
            ).toDF
val tab2 = Seq(
              tab2x(1,656,"IND","S"),
              tab2x(2,657,"USA","N"),
              tab2x(3,658,"RUS","N")
            ).toDF

val joined_df = tab1.join(broadcast(tab2), "location")

val z = joined_df.filter($"pole" === "N")         

z.show()