Support Questions
Find answers, ask questions, and share your expertise

Looping in spark

Looping in spark

Explorer

Dear Folks,


Note: It's an urgent requirement, your suggestions will be appreciated.


Here my requirement, it's a batch job.

For every run, I have to load five hive tables.

I created separate dataframe objects for all five tables and calling it inside the main function.

Here am using flags to get the user input and starts running the job.

Below code works fine for two tables.


If I pass the argument "all", five tables got loaded without issue.

sometimes on an ad-hoc basis, I may need to load two tables or three tables based on requirement.

How can I achieve in my code?


Eg:

For today run's, I need to load only three tables.

I passed the table names as arguments while submitting the job


spark-submit --class .. --master yarn eimreporting tableA tableB tableC


code:

----

object Medinsight_Main {
	



	



	


def main(args: Array[String]): Unit = {
	



	


val conf = new SparkConf().setAppName("Eim_Reporting")


	


val sc = new SparkContext(conf)

sc.setLogLevel("WARN")

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)


	


try {
	


if (args(0).toLowerCase() == "eimreporting" && args(1).toLowerCase() == "all") {
	



	


eiminsight_claim_agg.Transform(sqlContext) --> calling TableA object

eiminsight_member.Transform(sqlContext) ---> calling TableB object

}

else if (args(0).toLowerCase() == "eimreporting" && args(1).toLowerCase() == "tableA") {
	



	


eiminsight_claim_agg.Transform(sqlContext) ---> calling TableA object

}

else if (args(0).toLowerCase() == "eimreporting" && args(1).toLowerCase() == "tableB") {
	


eiminsight_member.Transform(sqlContext) ---> calling TableB object

}


else {
	


System.out.println("No argumnts"

}

}

catch{
	


exception

}

finally{
	


sc.stop()

}

1 REPLY 1

Re: Looping in spark

Super Collaborator

I am not sure if this answers your question, but one important thing to note is that Spark uses lazy evaluation.

 

As such, if you have load commands, or transformations, Spark will ignore these untill you actually 'do' something with the data.

 

Examples of 'doing something' could be printing, or saving the data.

 

As such simply writing a spark function for which your input controls what you actually do with data, should prevent irrelevant datasets to get loaded.

 

----

Sidenote: In general the recommended course of action in case of urgent functional questions is reaching out to the account team.


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'. Also check out my techincal portfolio at https://portfolio.jaheruddin.nl