06-22-2018 01:56 AM
I have a function which creates selects data from a dataframe based on filter, does some operation on it and writes to hard disk.
I want to do this process in parallel utilizing all worker nodes by calling the same function, but distributing to different nodes. Here is what I attempted using multiprocessing in pyspark. But it doesn't really seem to be running in parallel, as performance is same as for-loop (running one after another):
from multiprocessing.dummy import Pool def func(name): obj_filter=obj_mod.select("Cord1","Cord2","Cord3").where(col("obj_type")==name) np_obj_filter=np.array(obj_filter.rdd.map(lambda l: (l,l,l)).collect()) tree=spatial.cKDTree(np_obj_filter,leafsize=16) with open('/home/tree_'+str(name), 'wb') as f: pickle.dump(tree, f) if __name__ == '__main__': pool = Pool(processes=4) result = pool.map(func, obj_list_cur)
Has anybody tried something similar? How can i run same function across different nodes in parallel, which takes different parameters on different nodes.