Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to expand a single row with a start and end date into multiple rows, one for each day

avatar
Expert Contributor

I have a rows like this below.

+--------------+-------------------+-------------------+
|            id|    line_start_date|      line_end_date|
+--------------+-------------------+-------------------+
|        ID1   | 8/15/2016 00:00:00| 8/21/2016 23:55:59|
|        ID2   | 1/25/2016 00:00:00| 1/31/2016 23:55:59|
|        ID3   | 2/29/2016 00:00:00| 2/29/2016 23:55:59|
|        ID4   |  2/8/2016 00:00:00| 2/14/2016 23:55:59|
|        ID5   | 7/25/2016 00:00:00| 7/31/2016 23:55:59|
|        ID6   |  8/1/2016 00:00:00|  8/7/2016 23:55:59|
+--------------+-------------------+-------------------+







Note:date format is "MM/dd/yyyy HH:mm:ss" in UTC

I want to generate individual dates between the start date and end date for each of the Ids.

As a result I will have more rows and easily do a groupBy to do aggregation I want.

Example Output Required:

+--------------+-------------------+
|            id|    dates          |   
+--------------+-------------------+
|        ID1   | 8/15/2016 00:00:00|
|        ID1   | 8/16/2016 00:00:00|
|        ID1   | 8/17/2016 00:00:00| 
|        ...   | ......            | 
|        ...   | ......            | 
+--------------+-------------------+

How can I do this in dataframe API? Searched for hours, no clue yet!

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Something similar using RDDs

Steps

  • Read file as RDD
  • Create new RDD - for each line/entry on the file create a list of tuples (id,date), for each date between d1 and d2
  • Flatten the list to generate the final RDD with each id, date combination per row
 def main(args: Array[String]): Unit = 
{    
var sc = new SparkContext("local[*]", "app1")
varfileRdd = sc.textFile("inFile");    
var explodedRdd = fileRdd.map{x=>getRddList(x)}.flatMap(y=>y)
explodedRdd.saveAsTextFile("outDir")  
}  
def getDaysBetweenDates(startdate: Date, enddate: Date): ListBuffer[String] =    {      
var dateList = new ListBuffer[String]()      
var calendar = new GregorianCalendar()      
calendar.setTime(startdate)      
while (calendar.getTime().before(enddate)) {        
dateList += calendar.getTime().toString()        
calendar.add(Calendar.DATE, 1)     
}      
dateList += calendar.getTime().toString()
dateList    
} 
 
def getRddList(a :String) : ListBuffer[(String,String)] = {    
var allDates = new ListBuffer[(String,String)]()    
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")    
for (x <- getDaysBetweenDates(format.parse(a.split(",")(1)), format.parse(a.split(",")(2)))){  
allDates += ((a.split(",")(0).toString(),x))    
}
allDates  
}

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Something similar using RDDs

Steps

  • Read file as RDD
  • Create new RDD - for each line/entry on the file create a list of tuples (id,date), for each date between d1 and d2
  • Flatten the list to generate the final RDD with each id, date combination per row
 def main(args: Array[String]): Unit = 
{    
var sc = new SparkContext("local[*]", "app1")
varfileRdd = sc.textFile("inFile");    
var explodedRdd = fileRdd.map{x=>getRddList(x)}.flatMap(y=>y)
explodedRdd.saveAsTextFile("outDir")  
}  
def getDaysBetweenDates(startdate: Date, enddate: Date): ListBuffer[String] =    {      
var dateList = new ListBuffer[String]()      
var calendar = new GregorianCalendar()      
calendar.setTime(startdate)      
while (calendar.getTime().before(enddate)) {        
dateList += calendar.getTime().toString()        
calendar.add(Calendar.DATE, 1)     
}      
dateList += calendar.getTime().toString()
dateList    
} 
 
def getRddList(a :String) : ListBuffer[(String,String)] = {    
var allDates = new ListBuffer[(String,String)]()    
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")    
for (x <- getDaysBetweenDates(format.parse(a.split(",")(1)), format.parse(a.split(",")(2)))){  
allDates += ((a.split(",")(0).toString(),x))    
}
allDates  
}

avatar
Super Collaborator

++ you could then convert the RDD to a dataframe if required.

avatar
Expert Contributor

works perfectly now.

avatar
Expert Contributor

Thanks! I will be testing it soon and will be accepting your answer if it works out.