Support Questions
Find answers, ask questions, and share your expertise

read multiple table parallel using Spark

New Contributor



I've created spark job to read data from multiple tables sequentially(looping through each table) however i believe this would not be optimised way of reading data. I'm wondering is there any way to leverage power of Spark to read data from multiple tables parallel? 


Here is my sudo script.


table_list = ['table1', 'table2','table3', 'table4']

for table in table_list:
     jdbcDF = \
         .format("jdbc") \
         .option("url", "jdbc:postgresql:dbserver") \
         .option("dbtable", "schema.{}".format(table)) \
         .option("user", "username") \
         .option("password", "password") \


Another challenge with current solution is reading data from gigantic table is slow. I found a way to implement parallel read using partitionColumn however not sure if it only works with Numeric values (Sequential values)




Cloudera Employee



You try 'spark.scheduler.mode' as 'FAIR'


conf.set("spark.scheduler.mode", "FAIR")

 so that multiple job will be executed in parallel.


Please refer document [1]