I am new to Spark and would like to get suggestions on the best approach I could use for my Spark Program.
Various HDFS directories each streaming structured data(Parquet files) from different sources into new file every minute.
A table containing data joined using foreign keys from various sources received above
Approach (Please suggest)
1. Use Spark Streaming to monitor the HDFS directories, receive the structured data, convert to df and store them as Hive tables
2. Use one of the stream input to trigger a method which performs the join (using Spark SQL on df) on other hive tables from #1 and stores the output to hive/hbase table.
Is this the best way to join the related structured data or suggest if there is any other efficient method?
Would there be an overhead in the long run when the hive tables from step #1 grows bigger and the step #2 reads the entire table every time to perform the join? Is there an efficient way to overcome this?
Appreciate your time and thanks in advance.
I'm not very sure but you may consider the following (independently of the other):