Hello everyone !! I need a help in addressing an issue i.e updation of pyspark dataframe based on extraction of one or more records of a group . Below is the screen shot of my sample data (image 1) and required output(image2) and scenario. Scenario: A group of customers(1000) need certain products(250) from several stores (200 in number). Each customer has a demand for certain products and all the store have some inventory related to some products or all products. Products has to be allocated to individual customer based on savings incurred to stores. For example customers A & B need product 1 from store 1 and their demand is 10 & 15, store 1 inventory for product 1 is 12. B can also take product 1 from store 2 which has an inventory of 10. A can also take product 1 from store 3 which has inventory of 20 Output Based on my logic that I wrote this is sample output Store 1 allots 10 products to A and 2 products to B based on profit incurred. Store 2 allots 10 products to B and his complete demand cannot be met as none of store have inventory Store 3 cannot allot any item to A as he was already allotted from store 1 My data size is 6 GB and I developed a python script using "for loop" through each n every row to address this issue, however it can't be run on spark as this will not be a parallel processing job. Kindly help me in addressing this issue in pyspark or sparkQL or HiveQL Thanks in advance for your help !!
... View more