Welcome to the Cloudera Community

n3 · ‎05-20-2016

Hey, I want to build a big data environment but I'm not so familiar with all the great tools. Many tools are very similar, but the devil is in the details. Maybe you can help me to validate my thoughts so I can start with a solide base. I want to connect the following data resources to hadoop as an example:

Twitter Stream
Chat
boards
...

With a REST API I want to search for single words per stream or in all stream. There should also be the option to search in the whole dataset or only in the past 24h. The methodes (UseCase) could be:

findwordintwitter
findwordinchat
...
findwordintwitter24h
findwordinchat24h
...
findwordinallstreams
findwordinallstreams24h

The idea was to use Flume, hbase and KNOX. But is it so simple? Now I swap Flume with NiFi.

I need a sub-second response time so I will go with HBase. Until know, data will be pushed/pulled to/with NiFi and stored into HDFS. In HBase I can convert them into an OCRfile and access them with Hive. So I have the speed of HBase and simple access with Hive over WebHCat. Am I right?

I need to read more about OCRfile and Avro, because it is not really clear to me. I read also, that Kafka will be a good tool to use. What do you think?

KNOX will secure the in- and outgoing connections. But I think, that I miss a lot and it is not so simple like I mention. Maybe I need a pipeline like Kafka for each UseCase, or one hbase instance per stream. I am struck by the large number of tools and I hope that someone can give me a hint which tools I need. A little architecture overview with a explanation would be great, so I get a clue to build on it.

thanks, Jan

Cloudera Community

Welcome to the Cloudera Community

Who agreed with this topic

Improvement proposal for my architecture