About r37071231

Shu_ashu · ‎07-01-2019

@DADA206 If you are thinking to count all duplicated rows you can use one of these methods. 1.Using dropDuplicates function: scala> val df1=Seq((1,"q"),(2,"c"),(3,"d"),(1,"q"),(2,"c"),(3,"e")).toDF("id","n") scala> println("duplicated counts:" + (df1.count - df1.dropDuplicates.count)) duplicated counts:2 There are 2 duplicated rows in the dataframe it means in total there are 4 rows duplicated. 2.Using groupBy on all columns: scala> import org.apache.spark.sql.functions._ scala> val cols=df1.columns scala> df1.groupBy(cols.head,cols.tail:_*).agg(count("*").alias("cnt")).filter('cnt > 1).select(sum("cnt")).show() +--------+ |sum(cnt)| +--------+ | 4| +--------+ 3.Using window functions: scala> import org.apache.spark.sql.expressions.Window scala> val wdw=Window.partitionBy(cols.head,cols.tail:_*) wdw: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@1f6df7ac scala> df1.withColumn("cnt",count("*").over(wdw)).filter('cnt > 1).count() res80: Long = 4

r37071231 · ‎04-23-2019

Hi all, I had resolved this problem by add missing jar files to "/usr/hdp/3.1.0.0-78/storm/lib/" . Thanks.

Online	Offline
Last Visited	‎07-25-2019 12:19 PM

Member Since	‎04-19-2019 02:46 AM
Last Visited	‎07-25-2019 12:19 PM
Posts	4

Cloudera Community

Re: NoClassDefFoundError in org.apache.storm.kafka...

Re: Scala- How to find duplicated columns with all...

Re: NoClassDefFoundError in org.apache.storm.kafka...