Created 06-01-2016 10:50 AM
In a teeing based solution where the data is ingested simultaneously to two clusters, can Falcon be used, similar to Flume multi-sink?
Alternatively, is it better done with HDF in comparison with Falcon? What are the benefits?
Created 06-01-2016 10:59 AM
Its a bit comparing oranges to apples. Falcon is used to pipe huge amounts of data between hadoop clusters. ( using distcp and other tools ). And it can schedule like oozie transformation tasks that are supposed to run in a cluster.
HDF is a streaming solution a bit more similar to flume ( HDF fans will hit me for that comparison ) for ingesting data into an hadoop cluster ( and doing other things with it. )
So the question is you have data streams ( logs, IOT data, social media data, ... ) coming in from outside an hadoop cluster? HDF is perfect for it and you can easily add two outputs to different clusters.
You have a source cluster and want to move data to two target clusters and do some in cluster computations like ETL? Falcon/Oozie with distcp. Doesn't mean that you couldn't use HDF for that as well but it would not be as natural.
Created 06-01-2016 10:59 AM
Its a bit comparing oranges to apples. Falcon is used to pipe huge amounts of data between hadoop clusters. ( using distcp and other tools ). And it can schedule like oozie transformation tasks that are supposed to run in a cluster.
HDF is a streaming solution a bit more similar to flume ( HDF fans will hit me for that comparison ) for ingesting data into an hadoop cluster ( and doing other things with it. )
So the question is you have data streams ( logs, IOT data, social media data, ... ) coming in from outside an hadoop cluster? HDF is perfect for it and you can easily add two outputs to different clusters.
You have a source cluster and want to move data to two target clusters and do some in cluster computations like ETL? Falcon/Oozie with distcp. Doesn't mean that you couldn't use HDF for that as well but it would not be as natural.
Created 06-01-2016 12:46 PM
Thanks @Benjamin Leonhardi I am looking for a solution around a source cluster feeding into two downstream clusters. In another question on HCC, there was a mention of HDF as a good fit and hence wanted to understand the merits in comparison with Falcon
Created 06-01-2016 04:08 PM
Yeah I would say something like Oozie/distcp might be your better bet here. It fits nicely into the ETL flow you would have in your cluster anyway. HDF is very powerful and in many areas much nicer to use than oozie/falcon. However if you have a hadoop cluster you normally want to do bulk processing in it and this would be scheduled by oozie/falcon, so using these frameworks to propagate results or raw files to other clusters also seems to make sense to me. I would see HDF more as the tool that gathers all information and brings it into the cluster.
Created 05-09-2017 08:14 AM
Just curious to know how you implemented it?