Support Questions
Find answers, ask questions, and share your expertise

HDF can handle all kinds of data?

New Contributor

I was going through the data ingestiono techniques. Is HDF a one stop shop for all ingestions? Apart from Nifi what all are the components of HDF (Flume, sqoop ,Kafka ..) Do HDF handle all of following types of data?

  • -Data at rest
  • -Data in motion
  • -Streaming Data
  • -Data from a Web server
  • -Data from a data warehouse

@Rahul Raj

To correct something in your question, Flume and Sqoop are part of HDP, not HDF.

HDF allows you to manage your data in motion. It is primarily a data flow tool. However, it is capable of performing simple event processing. Think of HDF as a way to manage and route data between different systems. Those systems can be two different HDP clusters, or the systems can be an HDF and HDP cluster, or a REST API endpoint and an HDP cluster, etc.

With over 200 processors, HDF is flexible and allows you to manage data flows between any number of systems. HDF can pull data from web servers, RESTful APIs, Data Warehouses (RDBMS). It can handle streaming data or bulk pulling data at rest from an HDFS filesystem.

New Contributor

Correct me wherever I am wrong. So HDF just pulls data from absolutely anywhere and sqoop,kafka(?)Flume is required in HDP side to put the data in HDFS ultimately?

@Rahul Raj

HDF is very flexible. It comes bundled with a number of processors (over 200). Those processors allow you to pull and push data, or to do some transformations etc. Some processors are flexible in that they allow you to interact with web services via standard HTTP requests. On top of that, you can create scripts (Python, Javascript, etc) and run those scripts from within HDF. Then you have the ability to easily write your own processors to extend HDF.

Sqoop is generally used to bulk pull data from RDBMS and then you use HDF to pull the incremental changes. It depends on the volume of data and the use case. In some cases, you can just use HDF. In others, you might use Sqoop first, then HDF for those updates. Anything Flume can do, HDF can do better. So if you are considering Flume, then HDF can do it. HDF has the ability to directly land data into Kafka, HDFS, HBase, Hive etc. So again, it depends on your use case and the volume of data in terms of what the flow would look like, but HDF can land the data into HDP after pulling it from another source (DB, Cassandra, HTTP, Elasticsearch, Solr, etc).