Member since
04-19-2019
4
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2640 | 04-23-2019 03:18 AM |
07-01-2019
02:19 PM
@DADA206 If you are thinking to count all duplicated rows you can use one of these methods. 1.Using dropDuplicates function: scala> val df1=Seq((1,"q"),(2,"c"),(3,"d"),(1,"q"),(2,"c"),(3,"e")).toDF("id","n")
scala> println("duplicated counts:" + (df1.count - df1.dropDuplicates.count))
duplicated counts:2 There are 2 duplicated rows in the dataframe it means in total there are 4 rows duplicated. 2.Using groupBy on all columns: scala> import org.apache.spark.sql.functions._
scala> val cols=df1.columns
scala> df1.groupBy(cols.head,cols.tail:_*).agg(count("*").alias("cnt")).filter('cnt > 1).select(sum("cnt")).show()
+--------+
|sum(cnt)|
+--------+
| 4|
+--------+ 3.Using window functions: scala> import org.apache.spark.sql.expressions.Window
scala> val wdw=Window.partitionBy(cols.head,cols.tail:_*)
wdw: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@1f6df7ac
scala> df1.withColumn("cnt",count("*").over(wdw)).filter('cnt > 1).count()
res80: Long = 4
... View more
04-23-2019
03:18 AM
Hi all, I had resolved this problem by add missing jar files to "/usr/hdp/3.1.0.0-78/storm/lib/" . Thanks.
... View more