Support Questions
Find answers, ask questions, and share your expertise

Multiple withColumn with lag

Explorer

Hi.

To generate Sankey chart I run this code.

* code

import sqlContext.implicits._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

var customers = sc.parallelize(List(("Alice", "click","item_8", 50),
                                    ("Alice", "view","item_2", 55),
                                    ("Alice", "share","item_11", 100),
                                    ("Bob", "view","item_11", 25),
                                    ("Bob", "share","ietm_2", 50),
                                    ("Bob", "view", "item_8",65))).toDF("name", "event", "item", "time")
                                    
val wSpec3 = Window.partitionBy("name").orderBy("time")
customers.withColumn(
    "prev_event", lag(col("event"),1).over(wSpec3)
).withColumn(
    "prev_item", lag(col("item"),1).over(wSpec3)
).withColumn(
    "prev_time", lag(col("time"),1).over(wSpec3)
).na.fill(Map(
     "prev_time" -> -10000
)).withColumn(
    "time_gap", col("time") - col("prev_time")
).select("name","prev_event","prev_time","prev_item","event","time","item","time_gap")show()

* output

+-----+----------+---------+---------+-----+----+-------+--------+
| name|prev_event|prev_time|prev_item|event|time|   item|time_gap|
+-----+----------+---------+---------+-----+----+-------+--------+
|Alice|      null|   -10000|     null|click|  50| item_8|   10050|
|Alice|     click|       50|   item_8| view|  55| item_2|       5|
|Alice|      view|       55|   item_2|share| 100|item_11|      45|
|  Bob|      null|   -10000|     null| view|  25|item_11|   10025|
|  Bob|      view|       25|  item_11|share|  50| ietm_2|      25|
|  Bob|     share|       50|   ietm_2| view|  65| item_8|      15|
+-----+----------+---------+---------+-----+----+-------+--------+

That output is exactly what I expected,

But what about verbose multiple withColumn(... lag(...)...) codes?

Is that code can be improved?