I was going through the data ingestiono techniques. Is HDF a one stop shop for all ingestions? Apart from Nifi what all are the components of HDF (Flume, sqoop ,Kafka ..) Do HDF handle all of following types of data?
To correct something in your question, Flume and Sqoop are part of HDP, not HDF.
HDF allows you to manage your data in motion. It is primarily a data flow tool. However, it is capable of performing simple event processing. Think of HDF as a way to manage and route data between different systems. Those systems can be two different HDP clusters, or the systems can be an HDF and HDP cluster, or a REST API endpoint and an HDP cluster, etc.
With over 200 processors, HDF is flexible and allows you to manage data flows between any number of systems. HDF can pull data from web servers, RESTful APIs, Data Warehouses (RDBMS). It can handle streaming data or bulk pulling data at rest from an HDFS filesystem.
Correct me wherever I am wrong. So HDF just pulls data from absolutely anywhere and sqoop,kafka(?)Flume is required in HDP side to put the data in HDFS ultimately?
Sqoop is generally used to bulk pull data from RDBMS and then you use HDF to pull the incremental changes. It depends on the volume of data and the use case. In some cases, you can just use HDF. In others, you might use Sqoop first, then HDF for those updates. Anything Flume can do, HDF can do better. So if you are considering Flume, then HDF can do it. HDF has the ability to directly land data into Kafka, HDFS, HBase, Hive etc. So again, it depends on your use case and the volume of data in terms of what the flow would look like, but HDF can land the data into HDP after pulling it from another source (DB, Cassandra, HTTP, Elasticsearch, Solr, etc).