Support Questions

Find answers, ask questions, and share your expertise

best tools to import data from a myriad of sources

avatar
New Contributor

we currently hand code imports into our relational database from each source system and it is very cumbersome. Examples of the source systems would be data source like salesforce, twitter, another database, file, sharepoint, etc.

In our next line of software we would like to use a technology stack that has a lot of connectors already built to move the data from a source system into our target system of either Hadoop or mysql. Ideally, these connectors can be easily built and even scriptable. we do not want to reinvent the wheel and looking for some good open source tools to quickly import data into our target system from a large variety of sources.

If this was your requirement, which technology stack would you use and why? It seems to be a common theme in a lot of products to build a generic way to consume data into your system with a lot of community support. Why reinvent the wheel over and over again?

1 ACCEPTED SOLUTION

avatar
Guru

@Fred Schwartz

NiFi is ideal for exactly your needs. NiFi is a 100% open source Apache project. NiFi also is packaged in Hortonworks Data Flow (HDF) platform where it is bundled with Kafka, Storm, Ambari and Ranger. HDF is completely enterprise multitenant and secure.

NiFi is built to pull data from dozens of data sources ranging from relational databases to email to twitter, local files,S3, HTTP and so on. It has prebuilt connectors to these sources and is developed in an easy-to-configure drag-and-drop way. You can easily build your own connectors, and since this is open source new ones are added continuously.

In addition to pulling from a number of sources you can push to diverse target sources as well. HDFS, hive, kafka are possibilities as well as email, Amazon S3 and many more. Note that HDF works as a great compliment to HDP (hadoop) but does not require it.

In between pulling from sources and pushing to targets, NiFi allows you to transform data, route based on contents, merge data and more mediations of data.

You can get an idea of the data sources you can pull from, the mediations you can make on that data, and the targets you can push to by looking at this list of processors (processors are the basic units you connect into a data flow).https://nifi.apache.org/docs.html

Again, one of the great things about NiFi is its easy to use UI/configuration approach (screenshot below answer).

HCC has numerous articles on NiFi. Just do a search.

Check out:

http://hortonworks.com/apache/nifi/ http://hortonworks.com/blog/hortonworks-dataflow-2-0-ga/

https://nifi.apache.org/docs.html

https://www.youtube.com/watch?v=jctMMHTdTQI

You can download and start using it here: http://hortonworks.com/downloads/#dataflow

8233-screen-shot-2016-10-03-at-72633-pm.png

View solution in original post

2 REPLIES 2

avatar
Guru

@Fred Schwartz

NiFi is ideal for exactly your needs. NiFi is a 100% open source Apache project. NiFi also is packaged in Hortonworks Data Flow (HDF) platform where it is bundled with Kafka, Storm, Ambari and Ranger. HDF is completely enterprise multitenant and secure.

NiFi is built to pull data from dozens of data sources ranging from relational databases to email to twitter, local files,S3, HTTP and so on. It has prebuilt connectors to these sources and is developed in an easy-to-configure drag-and-drop way. You can easily build your own connectors, and since this is open source new ones are added continuously.

In addition to pulling from a number of sources you can push to diverse target sources as well. HDFS, hive, kafka are possibilities as well as email, Amazon S3 and many more. Note that HDF works as a great compliment to HDP (hadoop) but does not require it.

In between pulling from sources and pushing to targets, NiFi allows you to transform data, route based on contents, merge data and more mediations of data.

You can get an idea of the data sources you can pull from, the mediations you can make on that data, and the targets you can push to by looking at this list of processors (processors are the basic units you connect into a data flow).https://nifi.apache.org/docs.html

Again, one of the great things about NiFi is its easy to use UI/configuration approach (screenshot below answer).

HCC has numerous articles on NiFi. Just do a search.

Check out:

http://hortonworks.com/apache/nifi/ http://hortonworks.com/blog/hortonworks-dataflow-2-0-ga/

https://nifi.apache.org/docs.html

https://www.youtube.com/watch?v=jctMMHTdTQI

You can download and start using it here: http://hortonworks.com/downloads/#dataflow

8233-screen-shot-2016-10-03-at-72633-pm.png

avatar
New Contributor

This looks really promising Greg - thank you - I will check this out.