Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to expand a single row with a start and end date into multiple rows, one for each day

avatar
Expert Contributor

I have a rows like this below.

+--------------+-------------------+-------------------+
|            id|    line_start_date|      line_end_date|
+--------------+-------------------+-------------------+
|        ID1   | 8/15/2016 00:00:00| 8/21/2016 23:55:59|
|        ID2   | 1/25/2016 00:00:00| 1/31/2016 23:55:59|
|        ID3   | 2/29/2016 00:00:00| 2/29/2016 23:55:59|
|        ID4   |  2/8/2016 00:00:00| 2/14/2016 23:55:59|
|        ID5   | 7/25/2016 00:00:00| 7/31/2016 23:55:59|
|        ID6   |  8/1/2016 00:00:00|  8/7/2016 23:55:59|
+--------------+-------------------+-------------------+







Note:date format is "MM/dd/yyyy HH:mm:ss" in UTC

I want to generate individual dates between the start date and end date for each of the Ids.

As a result I will have more rows and easily do a groupBy to do aggregation I want.

Example Output Required:

+--------------+-------------------+
|            id|    dates          |   
+--------------+-------------------+
|        ID1   | 8/15/2016 00:00:00|
|        ID1   | 8/16/2016 00:00:00|
|        ID1   | 8/17/2016 00:00:00| 
|        ...   | ......            | 
|        ...   | ......            | 
+--------------+-------------------+

How can I do this in dataframe API? Searched for hours, no clue yet!

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Something similar using RDDs

Steps

  • Read file as RDD
  • Create new RDD - for each line/entry on the file create a list of tuples (id,date), for each date between d1 and d2
  • Flatten the list to generate the final RDD with each id, date combination per row
 def main(args: Array[String]): Unit = 
{    
var sc = new SparkContext("local[*]", "app1")
varfileRdd = sc.textFile("inFile");    
var explodedRdd = fileRdd.map{x=>getRddList(x)}.flatMap(y=>y)
explodedRdd.saveAsTextFile("outDir")  
}  
def getDaysBetweenDates(startdate: Date, enddate: Date): ListBuffer[String] =    {      
var dateList = new ListBuffer[String]()      
var calendar = new GregorianCalendar()      
calendar.setTime(startdate)      
while (calendar.getTime().before(enddate)) {        
dateList += calendar.getTime().toString()        
calendar.add(Calendar.DATE, 1)     
}      
dateList += calendar.getTime().toString()
dateList    
} 
 
def getRddList(a :String) : ListBuffer[(String,String)] = {    
var allDates = new ListBuffer[(String,String)]()    
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")    
for (x <- getDaysBetweenDates(format.parse(a.split(",")(1)), format.parse(a.split(",")(2)))){  
allDates += ((a.split(",")(0).toString(),x))    
}
allDates  
}

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Something similar using RDDs

Steps

  • Read file as RDD
  • Create new RDD - for each line/entry on the file create a list of tuples (id,date), for each date between d1 and d2
  • Flatten the list to generate the final RDD with each id, date combination per row
 def main(args: Array[String]): Unit = 
{    
var sc = new SparkContext("local[*]", "app1")
varfileRdd = sc.textFile("inFile");    
var explodedRdd = fileRdd.map{x=>getRddList(x)}.flatMap(y=>y)
explodedRdd.saveAsTextFile("outDir")  
}  
def getDaysBetweenDates(startdate: Date, enddate: Date): ListBuffer[String] =    {      
var dateList = new ListBuffer[String]()      
var calendar = new GregorianCalendar()      
calendar.setTime(startdate)      
while (calendar.getTime().before(enddate)) {        
dateList += calendar.getTime().toString()        
calendar.add(Calendar.DATE, 1)     
}      
dateList += calendar.getTime().toString()
dateList    
} 
 
def getRddList(a :String) : ListBuffer[(String,String)] = {    
var allDates = new ListBuffer[(String,String)]()    
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")    
for (x <- getDaysBetweenDates(format.parse(a.split(",")(1)), format.parse(a.split(",")(2)))){  
allDates += ((a.split(",")(0).toString(),x))    
}
allDates  
}

avatar
Super Collaborator

++ you could then convert the RDD to a dataframe if required.

avatar
Expert Contributor

works perfectly now.

avatar
Expert Contributor

Thanks! I will be testing it soon and will be accepting your answer if it works out.