I have a function which creates selects data from a dataframe based on filter, does some operation on it and writes to hard disk.
I want to do this process in parallel utilizing all worker nodes by calling the same function, but distributing to different nodes. Here is what I attempted using multiprocessing in pyspark. But it doesn't really seem to be running in parallel, as performance is same as for-loop (running one after another):
from multiprocessing.dummy import Pool
np_obj_filter=np.array(obj_filter.rdd.map(lambda l: (l,l,l)).collect())
with open('/home/tree_'+str(name), 'wb') as f:
if __name__ == '__main__':
pool = Pool(processes=4)
result = pool.map(func, obj_list_cur)
Has anybody tried something similar? How can i run same function across different nodes in parallel, which takes different parameters on different nodes.