Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

read multiple table parallel using Spark

read multiple table parallel using Spark

New Contributor

Hello,

 

I've created spark job to read data from multiple tables sequentially(looping through each table) however i believe this would not be optimised way of reading data. I'm wondering is there any way to leverage power of Spark to read data from multiple tables parallel? 

 

Here is my sudo script.

 

table_list = ['table1', 'table2','table3', 'table4']

for table in table_list:
     jdbcDF = spark.read \
         .format("jdbc") \
         .option("url", "jdbc:postgresql:dbserver") \
         .option("dbtable", "schema.{}".format(table)) \
         .option("user", "username") \
         .option("password", "password") \
         .load()

 

Another challenge with current solution is reading data from gigantic table is slow. I found a way to implement parallel read using partitionColumn however not sure if it only works with Numeric values (Sequential values)

partitionColumn

 

https://spark.apache.org/docs/2.0.0/api/R/read.jdbc.html

 

1 REPLY 1
Highlighted

Re: read multiple table parallel using Spark

Cloudera Employee

Hello,

 

You try 'spark.scheduler.mode' as 'FAIR'

 

conf.set("spark.scheduler.mode", "FAIR")

 so that multiple job will be executed in parallel.

 

Please refer document [1]

 

[1] https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

Don't have an account?
Coming from Hortonworks? Activate your account here