Reply
New Contributor
Posts: 1
Registered: ‎06-22-2018

How to do multiprocessing to distribute a function execution to multiple worker nodes

I have a function which creates selects data from a dataframe based on filter, does some operation on it and writes to hard disk.

 

I want to do this process in parallel utilizing all worker nodes by calling the same function, but distributing to different nodes. Here is what I attempted using multiprocessing in pyspark. But it doesn't really seem to be running in parallel, as performance is same as for-loop (running one after another):

 

from multiprocessing.dummy import Pool

def func(name):

    obj_filter=obj_mod.select("Cord1","Cord2","Cord3").where(col("obj_type")==name)
    np_obj_filter=np.array(obj_filter.rdd.map(lambda l: (l[0],l[1],l[2])).collect())
    tree=spatial.cKDTree(np_obj_filter,leafsize=16)
    with open('/home/tree_'+str(name), 'wb') as f:
    pickle.dump(tree, f)

if __name__ == '__main__':
    pool = Pool(processes=4)    
    result = pool.map(func, obj_list_cur)

 

Has anybody tried something similar? How can i run same function across different nodes in parallel, which takes different parameters on different nodes.

Announcements