New Contributor
Posts: 1
Registered: ‎06-22-2018

How to do multiprocessing to distribute a function execution to multiple worker nodes

I have a function which creates selects data from a dataframe based on filter, does some operation on it and writes to hard disk.


I want to do this process in parallel utilizing all worker nodes by calling the same function, but distributing to different nodes. Here is what I attempted using multiprocessing in pyspark. But it doesn't really seem to be running in parallel, as performance is same as for-loop (running one after another):


from multiprocessing.dummy import Pool

def func(name):"Cord1","Cord2","Cord3").where(col("obj_type")==name)
    np_obj_filter=np.array( l: (l[0],l[1],l[2])).collect())
    with open('/home/tree_'+str(name), 'wb') as f:
    pickle.dump(tree, f)

if __name__ == '__main__':
    pool = Pool(processes=4)    
    result =, obj_list_cur)


Has anybody tried something similar? How can i run same function across different nodes in parallel, which takes different parameters on different nodes.