Support Questions

chavda7a · ‎12-27-2019

Hello,

I've created spark job to read data from multiple tables sequentially(looping through each table) however i believe this would not be optimised way of reading data. I'm wondering is there any way to leverage power of Spark to read data from multiple tables parallel?

Here is my sudo script.

table_list = ['table1', 'table2','table3', 'table4']

for table in table_list:
     jdbcDF = spark.read \
         .format("jdbc") \
         .option("url", "jdbc:postgresql:dbserver") \
         .option("dbtable", "schema.{}".format(table)) \
         .option("user", "username") \
         .option("password", "password") \
         .load()

Another challenge with current solution is reading data from gigantic table is slow. I found a way to implement parallel read using partitionColumn however not sure if it only works with Numeric values (Sequential values)

partitionColumn

https://spark.apache.org/docs/2.0.0/api/R/read.jdbc.html

senthh · ‎12-28-2019

Hello,

You try 'spark.scheduler.mode' as 'FAIR'

conf.set("spark.scheduler.mode", "FAIR")

so that multiple job will be executed in parallel.

Please refer document [1]

[1] https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

Cloudera Community

Support Questions

read multiple table parallel using Spark