I'm very new to Big Data, and I'm currently working on a project in which we have a server that generates a set of CSV files very rapidly about our customers activities. I'm thinking about doing some analytics to get reports and visualisation in real time in the same time I want to archive that files for a long period to get insight about our customers activities every year I'm very confused which architecture to use.do you have any suggestion? I'm thinking about ingesting the files using flume and kafka then spark streaming for real time analysis and using HDFS or elastiquesearch for bach processing what do you think about this pipline?do you have other suggestions? waiting for you help
I am not sure I follow your pipeline.
I get the impression that Flume is being effectively deprecated in favour of (the more complicated) NiFi/HDF.
Kafka is only worthwhile if you really have ALOT of events being generated at once and I am not sure why you have added it in. Is it just because you know how to do Kafka->SparkStreaming.
HDFS is a file system for distributed storage. Elastic Search is quite literally a search engine. It is great for doing things like "tell me how many million records look like this", "now if I restrict it by this other criteria how many do I get now". Neither is "for batch processing".
It sounds like you are trying to build a lambda architecture system. Have you found any reference architectures for that?
Try considering a lambda architecture. You can find a basic explanation of it in the link below.
Your tooling selection will really depend on your particular use case.
For "Speed" layer, you can use Storm or Spark Streaming. IMHO the main selection criteria between the two will depend on whether you're interested in ultra low latency (Storm) or high throughput (Spark Streaming). There's other factors, but these are some of the main drivers.
For the "Serving" layer, your main choice is HBase. Depending on how you're going to query the "Serving" layer you may want to consider putting Phoenix on top of HBase. Since HBase is a NoSQL store, it has it's own API for making calls. Phoenix adds an abstraction layer on top of HBase and allows you to make queries in SQL format. Mind you, it's still in tech preview and may have some bugs here and there. Also, it's not meant for complex SQL queries.
For your ingest and simple event processing you can look into HDF/Nifi.
If you move beyond the HDP/HDF stack for the serving layer then your options increase to include other NoSQL stores as well as regular SQL DBs.
Below is a diagram of a sample Lambda architecture for a demo that receives sensor data from trucks and analysis them, along with driver behaviour, to determine the possibility of a driver committing a traffic violation/infraction. It will give you a better idea of what a lambda deployment may look like.
As always, if you find this post useful, don't forget to accept the answer.
You are on the right track. You can configure your server to dump the CSV files onto a shared directory. Then you set a Flume Agent with a Spool directory source and an Avro sink that links to Spark Streaming. The Spark streaming would read the Flume feeds and do two things at the same time: (1) Save the input data into parquet files on HDFS for batch analytics; (2) Process the feeds for real time enrichment or calculation. This architecture is simple, easy to implement, and works well. You can later use Hive to query the parquet files for offline queries. Your data is in CSV which means it is structured and has a defined schema - so using Elastic search or Solr will no make a great contribution to analytics. Also if you wish you can setup your visualization tools to query the parquet files in Near-Real-Time which would suite your needs.