Created 04-15-2020 11:30 AM
I have a source table Like
ID | USER | DEPT |
1 | User1 | Admin |
2 | User1 | Accounts |
3 | User2 | Finance |
4 | User3 | Sales |
5 | User3 | Finance |
I want to generate a DataFrame like this
ID | USER | DEPARTMENT |
1 | User1 | Admin,Accounts |
2 | User2 | Finance |
3 | User3 | Sales,Finance |
Created on 04-15-2020 05:01 PM - edited 04-15-2020 05:06 PM
Hi @ChineduLB ,
You can use `.groupBy` and `concat_ws(",",collect_list)` functions and to generate `ID` use `row_number` window function.
val df=Seq(("1","User1","Admin"),("2","User1","Accounts"),("3","User2","Finance"),("4","User3","Sales"),("5","User3","Finance")).toDF("ID","USER","DEPT")
import org.apache.spark.sql.expressions.Window
df.groupBy("USER"). agg(concat_ws(",",collect_list("DEPT")).alias("DEPARTMENT")). withColumn("ID",row_number().over(w)). select("ID","USER","DEPARTMENT").show()
Created 04-15-2020 12:08 PM
Hi Chinedu,
This should help: https://stackoverflow.com/questions/48406304/groupby-and-concat-array-columns-pyspark
Thanks,
Paul
Created 04-15-2020 12:27 PM
Thanks @pauldefusco
I would like to do it in spark - scala
Created on 04-15-2020 05:01 PM - edited 04-15-2020 05:06 PM
Hi @ChineduLB ,
You can use `.groupBy` and `concat_ws(",",collect_list)` functions and to generate `ID` use `row_number` window function.
val df=Seq(("1","User1","Admin"),("2","User1","Accounts"),("3","User2","Finance"),("4","User3","Sales"),("5","User3","Finance")).toDF("ID","USER","DEPT")
import org.apache.spark.sql.expressions.Window
df.groupBy("USER"). agg(concat_ws(",",collect_list("DEPT")).alias("DEPARTMENT")). withColumn("ID",row_number().over(w)). select("ID","USER","DEPARTMENT").show()