Support Questions
Find answers, ask questions, and share your expertise

Improvement proposal for my architecture



I want to build a big data environment but I'm not so familiar with all the great tools. Many tools are very similar, but the devil is in the details. Maybe you can help me to validate my thoughts so I can start with a solide base.

I want to connect the following data resources to hadoop:

  • Twitter Stream
  • Chat
  • boards
  • ...

With a REST API I want to search for single words per stream or in all stream. There should also be the option to search in the whole dataset or only in the past 24h. The methodes (UseCase) could be:

  • findwordintwitter
  • findwordinchat
  • ...
  • findwordintwitter24h
  • findwordinchat24h
  • ...
  • findwordinallstreams
  • findwordinallstreams24h

The idea was to use Flume, hbase and KNOX. But is it so simple? Flume will put the data into hbase and I can get my information through REST. KNOX will secure the in- and outgoing connections. But I think, that I miss a lot and it is not so simple like I mention.

Maybe I need a pipeline like Kafka for each UseCase, or one hbase instance per stream. I am struck by the large number of tools and I hope that someone can give me a hint which tools I need. A little architecture overview with a explanation would be great, so I get a clue to build on it.

thanks, Jan


Re: Improvement proposal for my architecture

Hi @Jan Horton so you're pretty close I think in terms of overall structure for your usecases.

I think I would swap out the flume piece for NiFi for the data movement/ingest as there are already pre-built twitter processor for ingesting data from twitter, assuming your Chat source is something that logs to a file which can be tailed from a server, you can just use the file listening processor to ingest that data source, and so on.

Take a look here for the other various NiFi processors shown as a list on the left hand side:

Once you have all the data streaming in, you can use NiFi to write it into HDFS.

Now your data is in HDFS I'd probably go to converting the raw files into an ORCfile, partitioning by date.

Then finally, I probably wouldn't go for the complexity and overhead of using HBase, but instead would just use Hive to query that information stored in ORCfile backed tables for the time window you're interested in.

Hope that helps, you were already pretty close so just a few refinements and you should be good to go!

Good luck!

Re: Improvement proposal for my architecture


Hey @Dave Russell, thank you very much for your help.

I was thinking about Flume and Kafka and thought that Flume will be the better tool to handle the job. I read also about NiFi, but I thougt it would be for an other usecase. Twitter was only one example. I will pull data from different sources and get data pushed by messaging systems into hadoop, and so on. The sources could be also google analytics. If NiFi is the still the tool, I will look at it in more detail. Thanks for that hint!

ORCfile is new for me. Need to read more about it. Until know, I was thinking about Avro.

Will be HDFS fast enough? If I want to know, how often the word "foobar" is used in stream1 and in all streams, or how often the word "foo" and "bar" is used in one message of one stream or all streams and so on in real time. In real time I mean, that the REST API should give me the answere within ms.

Regarding to the REST API... Do I need any special tool for the REST API? Is WebHDFS the way to go?

Re: Improvement proposal for my architecture

Fast enough is a question of what sort of response time you're looking for, if you're looking for sub-second response time today, then you're probably right, HBase may be required (and NiFi can feed data into that).

If you're happy with responses in a few seconds, then Hive is probably good enough, and much simpler.

Querying via the rest api is very simple with Hive, and is documented here:

Hope that helps!.

Re: Improvement proposal for my architecture


I need a sub-second response time so I will go with HBase. Until know, data will be pushed/pulled to/with NiFi and stored into HDFS. In HBase I can convert them into an OCRfile and access them with Hive. So I have the speed of HBase and simple access with Hive over WebHCat. Am I right?

I need to read more about OCRfile and Avro, because it is not really clear to me. I read also, that Kafka will be a good tool to use. What do you think?

Re: Improvement proposal for my architecture


Solr going against HDFS could give you that kind of performance. that might be an option to look at for your word count/frequency count use case.