I'm very new to Big Data, and I'm currently working on a project in which we have a server that generates a set of CSV files very rapidly about our customers activities. I'm thinking about doing some analytics to get reports and visualisation in real time in the same time I want to archive that files for a long period to get insight about our customers activities every year
I'm very confused which architecture to use.do you have any suggestion?
I'm thinking about ingesting the files using flume and kafka then spark streaming for real time analysis and using HDFS or elastiquesearch for bach processing
what do you think about this pipeline?do you have other suggestions?
waiting for you help
If the CSV files are sent to a directory, I'd consider using HDF to pick them up and flow them directly into HDP. Once there you have a number of options. You can try using the new LLAP dynamic text cache https://hortonworks.com/blog/top-5-performance-boosters-with-apache-hive-llap/ to query them directly or convert them to ORC tables.
You can also move them into ElasticSearch or Solr and create dashboards. Once the files are in Hive though you can use any visualization tool you want via ODBC or JDBC connections.
Hope this helps.
Try considering a lambda architecture. You can find a basic explanation of it in the link below.
Your tooling selection will really depend on your particular use case.
For "Speed" layer, you can use Storm or Spark Streaming. IMHO the main selection criteria between the two will depend on whether you're interested in ultra low latency (Storm) or high throughput (Spark Streaming). There's other factors, but these are some of the main drivers.
For the "Serving" layer, your main choice is HBase. Depending on how you're going to query the "Serving" layer you may want to consider putting Phoenix on top of HBase. Since HBase is a NoSQL store, it has it's own API for making calls. Phoenix adds an abstraction layer on top of HBase and allows you to make queries in SQL format. Mind you, it's still in tech preview and may have some bugs here and there. Also, it's not meant for complex SQL queries.
For your ingest and simple event processing you can look into HDF/Nifi.
If you move beyond the HDP/HDF stack for the serving layer then your options increase to include other NoSQL stores as well as regular SQL DBs.
Below is a diagram of a sample Lambda architecture for a demo that receives sensor data from trucks and analysis them, along with driver behaviour, to determine the possibility of a driver committing a traffic violation/infraction. It will give you a better idea of what a lambda deployment may look like.
As always, if you find this post useful, don't forget to accept the answer.
I would say to go with lambda architecture as it purpose is for both streaming and batch process.
To choose the tools for batch and streaming is tricky and it based on the volume of data which has to be captured for ingestion, source of the data from where it is captured , frequency of process which has to be triggered, nature of data - structured/un-structured /semi- structured.
Kafka and Spark combination will do good for lambda architecture.
Have a look at this link which explain with a scenario.
Spark serves near real time processing of data as well as hold good for batch processing as well. As a developer you need not have to concentrate on much about other area/tools in hadoop. Of course It may not serve for your purpose if its different from what spark could do. However whatever the tools you could choose there will be some trade off which has to be handled in one or the other way. If you could brief few other details of your need may be some other points can be mentioned. Hope it helps.