Reply
Highlighted
SV1
New Contributor
Posts: 1
Registered: ‎06-29-2018

How-to: Architect a NRT streaming Cloudera architecture for MUTABLE Data

Hello,

 

I came across the following blog post article posted on Cloudera that very well walks through a possible architecture related to LOG Analytics. From what I understand, key here it is they are LOGs/ClickStream Data so in other words it is IMMUTABLE data. In other words, data is pure streaming and would NOT get any updates. Here is the link to the blog: https://blog.cloudera.com/blog/2017/03/how-to-log-analytics-with-solr-spark-opentsdb-and-grafana/

 

Said that, I am trying to better learn and understand an architecture approach that would be best recommended by Cloudera or anyone who has practically implemented something like below:

 

1. Data consumer will be a front-end Web Application Interface

2. This front-end web application will used Standard SOLR API to access the data

3. SOLR Documents will be stored within CDH

4. Web App requires the data to be NRT (Ideally < 1 sec latency)

2. Data Sources are multiple RDBMS databases

3. Data elements within certain subject areas are MUTABLE (i.e. data can change)

4. Change Data should flow through RDBMS database into SOLR documents (Ideally < 1 sec latency)

5. SOLR will have past 3 years data and Searches will be done using SOLR for selected data elements

6. HBase will have past 10 years data or since inception and searches can be done for retrieving ALL data elements

 

I am trying to determine what would be possibly an architecture approach for something like this. Referencing to the Blog on Log Analytics, not sure if I can take similar approach given the data is MUTABLE. Is that right?

 

Here is what I was thinking as possible Architecture Approach but looking for optimized suggestions and recommendations to the approaches.

 

Approach 1:

Source RDBMS -> Q Replication Change Data Capture (CDC) -> Flume -> Kafka -> HBase -> Spark -> SOLR -> Standard API -> Web Application

 

NOTE: In the above approach SOLR is relying on HBASE so there is another data hop and bit different. HBase is being introduced because HBASE will have the entire data for SOLR documents to be able to effectively do UPDATES by accessing the required data from HBase

 

Approach 2 (more like what's in the reference blog post):

Source RDBMS -> Q Replication Change Data Capture (CDC) -> Flume -> Kafka -> Spark -> SOLR -> Standard API -> Web Application

AND

Source RDBMS -> Q Replication Change Data Capture (CDC) -> Flume -> Kafka -> HBase

 

NOTE: In the above approach how would SOLR hand UPDATES? will it have to go back to Source to get related information OR it can use it's previous state of the document and make the updates.

 

Approach 3:

Is there better approach? So there is much lesser data hops and requirements are still met?

 

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.