I was wondering how is cross component lineage handled in Atlas. I do see the example where data is loaded form Kafka to Storm and then into HDFS. The question is, is it specifically mentioned somewhere in the code that Storm read from what Kafka topic and wrote to which HDFS directory or is it somehow handled dynamically via some event notifications.For example if I read data from Kafka into Spark, Will i have to specify this information in the code for the Spark Application, thus changing the application?
The tutorial is at Cross Component Scripts. This tutorial makes use of a JAR file. Is it possible to get the source code or can someone point me to the code where this information is handled.
Spark has jobs, stages, tasks. We can model and send this information to Atlas via the hook to capture the details on what goes on inside Spark. What about Spark Streaming? Spark streaming has the same structure, only that the jobs are repeated every batch interval. Since streaming applications are long running, sending this much detailed information would make little sense as it might overwhelm the system and it seems it would be redundant information. Any suggestions on how should streaming info to Atlas be handled and what info should be sent?