Support Questions

Find answers, ask questions, and share your expertise

Best tools to ingest data to hadoop

Hi,

We are looking for best tools to ingest following types of files to hadoop

RDBMS: -

Log files:

MQ Messages/events:

Flat Text Files:

Thanks

Sree

1 ACCEPTED SOLUTION

@Sree Venkata

One stop tool is NiFi/HDF

http://hortonworks.com/webinar/introducing-hortonworks-dataflow/

for example:

Does HDF address delta load from Oracle database to HDFS?

A: HDF powered by Apache NiFi does support interaction with databases though it is narrowly focused. The SQL Processor set available today does not yet offer a complete change data capture solution. At a framework level, this use case is readily supportable. We expect to see increasing priority on providing a high quality user experience around database-oriented change data capture as we move forward.

How different is this from Flume, Kafka, or other data ingestion frameworks?

A: Kafka is a messaging system. Messaging systems generally are focused on providing mail-box like semantics whereby the ‘provider’ of data is decoupled from the ‘consumer’ of that data at least on a physical connectivity level. In enterprise dataflows, however, there are many other forms of decoupling to consider that are also critical. Protocol, format, schema, priority, and interest are all examples of important ‘separations of concern’ to consider. HDF powered by Apache NiFi is designed to address all of these forms of decoupling. In so doing, NiFi is often used with a system like Kafka, which is aimed at addressing one of those forms of decoupling but does so in a manner that can lead to very high performance under specific usage patterns. Kafka doesn’t address the user experience and real-time command and control aspects of the data lineage capabilities offered by HDF powered by Apache NiFi. The type of security that can be offered by messaging based systems will be largely limited to transport security, encryption of data at rest, and white-list style authorization to topics. HDF offers similar approaches as well but since it actually operates on and with the data it can also perform fine-grained security checks and rule-based contextual authorization. In the end, these systems are designed to tackle different parts of the data flow problem are often used together as a more powerful whole.

The comparison of HDF (and a flow using the “GetFile” processor along with a “PutHDFS” processor) to Flume is a more direct comparison in that they were designed to address very similar use cases. HDF offers data provenance as well as a powerful and intuitive user experience with a drag-and-drop UI for interactive command and control. From the management and data tracking perspectives HDF and Flume offer quite a different feature set. That said, Flume has been used considerably for some time now and, as is true with any system, the goal of HDF is to integrate with it in the best manner possible. As a result, HDF powered by Apache NiFi supports running Flume sources and sinks right in the flow itself. You can now wire many Flume sources and do so in a way that combines Flume’s configuration file approach with NiFi’s UI driven approach offering a best of both worlds solution.

View solution in original post

4 REPLIES 4

@Sree Venkata

One stop tool is NiFi/HDF

http://hortonworks.com/webinar/introducing-hortonworks-dataflow/

for example:

Does HDF address delta load from Oracle database to HDFS?

A: HDF powered by Apache NiFi does support interaction with databases though it is narrowly focused. The SQL Processor set available today does not yet offer a complete change data capture solution. At a framework level, this use case is readily supportable. We expect to see increasing priority on providing a high quality user experience around database-oriented change data capture as we move forward.

How different is this from Flume, Kafka, or other data ingestion frameworks?

A: Kafka is a messaging system. Messaging systems generally are focused on providing mail-box like semantics whereby the ‘provider’ of data is decoupled from the ‘consumer’ of that data at least on a physical connectivity level. In enterprise dataflows, however, there are many other forms of decoupling to consider that are also critical. Protocol, format, schema, priority, and interest are all examples of important ‘separations of concern’ to consider. HDF powered by Apache NiFi is designed to address all of these forms of decoupling. In so doing, NiFi is often used with a system like Kafka, which is aimed at addressing one of those forms of decoupling but does so in a manner that can lead to very high performance under specific usage patterns. Kafka doesn’t address the user experience and real-time command and control aspects of the data lineage capabilities offered by HDF powered by Apache NiFi. The type of security that can be offered by messaging based systems will be largely limited to transport security, encryption of data at rest, and white-list style authorization to topics. HDF offers similar approaches as well but since it actually operates on and with the data it can also perform fine-grained security checks and rule-based contextual authorization. In the end, these systems are designed to tackle different parts of the data flow problem are often used together as a more powerful whole.

The comparison of HDF (and a flow using the “GetFile” processor along with a “PutHDFS” processor) to Flume is a more direct comparison in that they were designed to address very similar use cases. HDF offers data provenance as well as a powerful and intuitive user experience with a drag-and-drop UI for interactive command and control. From the management and data tracking perspectives HDF and Flume offer quite a different feature set. That said, Flume has been used considerably for some time now and, as is true with any system, the goal of HDF is to integrate with it in the best manner possible. As a result, HDF powered by Apache NiFi supports running Flume sources and sinks right in the flow itself. You can now wire many Flume sources and do so in a way that combines Flume’s configuration file approach with NiFi’s UI driven approach offering a best of both worlds solution.

thanks for your response. Looking answers for kerberized cluster.

Currently Nifi version does not support kafka in kerberized cluster. Any thoughts on any other tools?

RDBMS -Can Nifi use native direct (oracle)connectors for RDBMS like Sqoop.

Specifically looking for big data loads

Thanks,

sree

RDBMS: Sqoop

Log files: Nifi/Flume or manual load ( hadoop put plus scripts )

MQ Messages/events: Nifi, Storm, Spark Streaming, ...

Flat Text Files: Oozie or cron jobs with scripts , Nifi/Flume,

Hi @Sree Venkata.

To ad to Neeraj's already excellent answer and to follow your comment, NiFi now *does* support kerberised clusters.

Also there is now an RDBMS connector, although I'd still say, use SQOOP if you're transferring very large chunks of RDBMS data and you want it parellised across the whole hadoop cluster, use NiFi if youve got smaller chunks to transfer that can be parallelised over a smaller NiFi cluster.

Hope that (in combination with Neeraj's answer) fulfills your requirements.