@spring spring I would say to go with lambda architecture as it purpose is for both streaming and batch process. To choose the tools for batch and streaming is tricky and it based on the volume of data which has to be captured for ingestion, source of the data from where it is captured , frequency of process which has to be triggered, nature of data - structured/un-structured /semi- structured. Kafka and Spark combination will do good for lambda architecture. Have a look at this link which explain with a scenario. https://dzone.com/articles/in-memory-mapreduce-and-your-hadoop-ecosystem-part Spark serves near real time processing of data as well as hold good for batch processing as well. As a developer you need not have to concentrate on much about other area/tools in hadoop. Of course It may not serve for your purpose if its different from what spark could do. However whatever the tools you could choose there will be some trade off which has to be handled in one or the other way. If you could brief few other details of your need may be some other points can be mentioned. Hope it helps.
... View more
You are on the right track. You can configure your server to dump the CSV files onto a shared directory. Then you set a Flume Agent with a Spool directory source and an Avro sink that links to Spark Streaming. The Spark streaming would read the Flume feeds and do two things at the same time: (1) Save the input data into parquet files on HDFS for batch analytics; (2) Process the feeds for real time enrichment or calculation. This architecture is simple, easy to implement, and works well. You can later use Hive to query the parquet files for offline queries. Your data is in CSV which means it is structured and has a defined schema - so using Elastic search or Solr will no make a great contribution to analytics. Also if you wish you can setup your visualization tools to query the parquet files in Near-Real-Time which would suite your needs.
... View more