Support Questions
Find answers, ask questions, and share your expertise
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

Analysis on real time streaming data

This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective Here's what the problem at hand looks like: We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.

    "browser_string":"Chrome 47.0.2526",
    "os":"Mac OS X",
    "city":"Palma De Mallorca",
    "region":"Islas Baleares",
    "ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",

Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.

One example of the report we need to build is Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.


Show me the sum of clicks on a particular button of all the users who are coming from referrer = “” and are based out of India and are using Desktop.

In one day this service sends out millions of such events amounting to GB’s of data per day. Here are the specific doubts I have

* How should we store this huge amount of data

* How should we enable ourselves to analyse the data in real time.

* How should the query system work here (I am relatively clueless about this part)

* If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?