Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Streaming on multiple input data sets and apply join through SQL

Spark Streaming on multiple input data sets and apply join through SQL

New Contributor

I am new to Spark and would like to get suggestions on the best approach I could use for my Spark Program.

Input:
Various HDFS directories each streaming structured data(Parquet files) from different sources into new file every minute.

Output:
A table containing data joined using foreign keys from various sources received above

Approach (Please suggest)
1. Use Spark Streaming to monitor the HDFS directories, receive the structured data, convert to df and store them as Hive tables
2. Use one of the stream input to trigger a method which performs the join (using Spark SQL on df) on other hive tables from #1 and stores the output to hive/hbase table.

Is this the best way to join the related structured data or suggest if there is any other efficient method?
Would there be an overhead in the long run when the hive tables from step #1 grows bigger and the step #2 reads the entire table every time to perform the join? Is there an efficient way to overcome this?

Appreciate your time and thanks in advance.

1 REPLY 1

Re: Spark Streaming on multiple input data sets and apply join through SQL

Expert Contributor

@Canada Love150

I'm not very sure but you may consider the following (independently of the other):

  • Join dataframes directly without storing them as Hive tables if your scenario permits that. And then store them as hive table.
  • I believe you will be either using createTempView or registerTempTable before joining df to already existing table in Hive? You can optimize joins by using map side joins as mentioned in the link, since new data would be smaller than the consolidated one, and can be kept in memory. You can join multiple tables with the same query as hive map joins requires all tables but one small enough to be kept in memory (if you want all data in one consolidated table). Or you can do multiple map side joins on different tables.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins#LanguageManualJoins-Examples

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

Don't have an account?
Coming from Hortonworks? Activate your account here