Reply
n3
Visitor
Posts: 0
Registered: ‎05-20-2016

Improvement proposal for my architecture

Hey, I want to build a big data environment but I'm not so familiar with all the great tools. Many tools are very similar, but the devil is in the details. Maybe you can help me to validate my thoughts so I can start with a solide base. I want to connect the following data resources to hadoop as an example:

  • Twitter Stream
  • Chat
  • boards
  • ...

 

With a REST API I want to search for single words per stream or in all stream. There should also be the option to search in the whole dataset or only in the past 24h. The methodes (UseCase) could be:

  • findwordintwitter
  • findwordinchat
  • ...
  • findwordintwitter24h
  • findwordinchat24h
  • ...
  • findwordinallstreams
  • findwordinallstreams24h

 

The idea was to use Flume, hbase and KNOX. But is it so simple? Now I swap Flume with NiFi.

I need a sub-second response time so I will go with HBase. Until know, data will be pushed/pulled to/with NiFi and stored into HDFS. In HBase I can convert them into an OCRfile and access them with Hive. So I have the speed of HBase and simple access with Hive over WebHCat. Am I right?

I need to read more about OCRfile and Avro, because it is not really clear to me. I read also, that Kafka will be a good tool to use. What do you think?

 

KNOX will secure the in- and outgoing connections. But I think, that I miss a lot and it is not so simple like I mention. Maybe I need a pipeline like Kafka for each UseCase, or one hbase instance per stream. I am struck by the large number of tools and I hope that someone can give me a hint which tools I need. A little architecture overview with a explanation would be great, so I get a clue to build on it.

 

thanks, Jan

Cloudera Employee
Posts: 39
Registered: ‎01-07-2019

Re: Improvement proposal for my architecture

It is hard to be sure that you cover the entire usecase, as long as you are only working 'on paper'.

 

Assuming you have the ability to use the Cloudera stack, I would recommend going step by step, starting at the beginning and adding tools as you need them.

 

A typical 'journey' could look like this.

 

1. Start by ingesting with NiFi

2. a. Offload to Hbase for rapid access, or HDFS/Kudu for analytical batch access

    b. If needed: Publish to Kafka and let an streaming engine like Spark listen to it

 

After these steps, you should be clear on what is still missing, and you can use more tools where needed.