Member since
07-07-2017
1
Post
0
Kudos Received
0
Solutions
07-08-2017
03:06 AM
Okay guys, I have a situation where I have the following schema for my data frame: Customer_Id Role_Code Start_TimeStamp End_Timestamp Ray123 1 2015 2017 Kate -- 2016 2017 I wish to decide the Role_Code of a given customer(say "Ray123") based on a few conditions. Let's say his Role_Code comes out to be 1. I then process the next row and the next customer(say "Kate123") has overlapping time with Ray123, then she can challenge Ray123 and might win against him to have Role_Code 1 (based on some other conditions). And so if she wins, for the overlapping time period, I need to set the Role_Code of Ray123 as 2 so the data looks like: Customer_Id Role_Code Start_TimeStamp End_Timestamp Ray123 1 2015 2016 Ray123 2 2016 2017 Kate123 1 2016 2017 There are similar things happening where I need to go back and forth and pick rows and compare the timestamps and some other fields, then take unions and do except etc to get a final data frame with the correct set of customers with correct set of role codes. The problem is, the solution works fine if i have 5-6 rows, but if i test against eg. 70 rows, the YARN container kills the job, it always runs out of memory. I don't know how else to solve this problem without multiple actions such as head(),first() etc coming in the way to process each row and then split the rows effectively. Any input can help at this point!!!
... View more
Labels:
- Labels:
-
Apache Spark