Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Falcon for Teeing

avatar
Rising Star

In a teeing based solution where the data is ingested simultaneously to two clusters, can Falcon be used, similar to Flume multi-sink?

Alternatively, is it better done with HDF in comparison with Falcon? What are the benefits?

1 ACCEPTED SOLUTION

avatar
Master Guru

Its a bit comparing oranges to apples. Falcon is used to pipe huge amounts of data between hadoop clusters. ( using distcp and other tools ). And it can schedule like oozie transformation tasks that are supposed to run in a cluster.

HDF is a streaming solution a bit more similar to flume ( HDF fans will hit me for that comparison ) for ingesting data into an hadoop cluster ( and doing other things with it. )

So the question is you have data streams ( logs, IOT data, social media data, ... ) coming in from outside an hadoop cluster? HDF is perfect for it and you can easily add two outputs to different clusters.

You have a source cluster and want to move data to two target clusters and do some in cluster computations like ETL? Falcon/Oozie with distcp. Doesn't mean that you couldn't use HDF for that as well but it would not be as natural.

View solution in original post

4 REPLIES 4

avatar
Master Guru

Its a bit comparing oranges to apples. Falcon is used to pipe huge amounts of data between hadoop clusters. ( using distcp and other tools ). And it can schedule like oozie transformation tasks that are supposed to run in a cluster.

HDF is a streaming solution a bit more similar to flume ( HDF fans will hit me for that comparison ) for ingesting data into an hadoop cluster ( and doing other things with it. )

So the question is you have data streams ( logs, IOT data, social media data, ... ) coming in from outside an hadoop cluster? HDF is perfect for it and you can easily add two outputs to different clusters.

You have a source cluster and want to move data to two target clusters and do some in cluster computations like ETL? Falcon/Oozie with distcp. Doesn't mean that you couldn't use HDF for that as well but it would not be as natural.

avatar
Rising Star

Thanks @Benjamin Leonhardi I am looking for a solution around a source cluster feeding into two downstream clusters. In another question on HCC, there was a mention of HDF as a good fit and hence wanted to understand the merits in comparison with Falcon

avatar
Master Guru

Yeah I would say something like Oozie/distcp might be your better bet here. It fits nicely into the ETL flow you would have in your cluster anyway. HDF is very powerful and in many areas much nicer to use than oozie/falcon. However if you have a hadoop cluster you normally want to do bulk processing in it and this would be scheduled by oozie/falcon, so using these frameworks to propagate results or raw files to other clusters also seems to make sense to me. I would see HDF more as the tool that gathers all information and brings it into the cluster.

avatar
Contributor

Hi @Greenhorn Techie,

Just curious to know how you implemented it?