Hello,
I've created spark job to read data from multiple tables sequentially(looping through each table) however i believe this would not be optimised way of reading data. I'm wondering is there any way to leverage power of Spark to read data from multiple tables parallel?
Here is my sudo script.
table_list = ['table1', 'table2','table3', 'table4']
for table in table_list:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.{}".format(table)) \
.option("user", "username") \
.option("password", "password") \
.load()
Another challenge with current solution is reading data from gigantic table is slow. I found a way to implement parallel read using partitionColumn however not sure if it only works with Numeric values (Sequential values)
partitionColumn
https://spark.apache.org/docs/2.0.0/api/R/read.jdbc.html