Below is the scenario I would need suggestions on,
Data ingestion is done through Nifi into Hive tables.
Spark program would have to perform ETL operations and complex joins on the data in Hive.
Since the data ingested from Nifi is continuous streaming, I would like the Spark jobs to run every 1 or 2 mins on the ingested data.
Which is the best option to use?
How do we reduce the over head and time lag in submitting the job recursively to the spark cluster? Is there a better way to run a single program recursively?
Can spark-streaming job get triggered automatically every 1 min and process the data from hive?
Is there any other efficient mechanism to handle such scenario?
Thanks in Advance