Support Questions
Find answers, ask questions, and share your expertise

read multiple table parallel using Spark

New Contributor

Hello,

 

I've created spark job to read data from multiple tables sequentially(looping through each table) however i believe this would not be optimised way of reading data. I'm wondering is there any way to leverage power of Spark to read data from multiple tables parallel? 

 

Here is my sudo script.

 

table_list = ['table1', 'table2','table3', 'table4']

for table in table_list:
     jdbcDF = spark.read \
         .format("jdbc") \
         .option("url", "jdbc:postgresql:dbserver") \
         .option("dbtable", "schema.{}".format(table)) \
         .option("user", "username") \
         .option("password", "password") \
         .load()

 

Another challenge with current solution is reading data from gigantic table is slow. I found a way to implement parallel read using partitionColumn however not sure if it only works with Numeric values (Sequential values)

partitionColumn

 

https://spark.apache.org/docs/2.0.0/api/R/read.jdbc.html

 

1 REPLY 1

Re: read multiple table parallel using Spark

Cloudera Employee

Hello,

 

You try 'spark.scheduler.mode' as 'FAIR'

 

conf.set("spark.scheduler.mode", "FAIR")

 so that multiple job will be executed in parallel.

 

Please refer document [1]

 

[1] https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application