Support Questions

Find answers, ask questions, and share your expertise

How to pull data from API and store it in HDFS

avatar
Expert Contributor

I am aware of flume and Kafka but these are event driven tools. I don't need it to be event driven or real time but may be just schedule the import once in a day.

What are the data ingestion tools available for importing data from API's in HDFS?

I am not using HBase either but Hive. I have used `R` language for that for quite a time but I am looking for a more robust,may be native solution to Hadoop environment.

1 ACCEPTED SOLUTION

avatar
Guru

From a relational database to HDFS or Hive, Sqoop is your best tool http://hortonworks.com/apache/sqoop/ You can schedule it through Oozie http://hortonworks.com/apache/oozie/

For diverse sources like logs, emails, rss, etc NiFi is your best bet. http://hortonworks.com/apache/nifi/ This includes Restful API capabilities via easy-to-configure HTTP processors. It has its own scheduler. HCC has many articles on NiFi.

You could also do a Restful wget from a linux server and push this to hdfs.

You could also use Zeppelin to pull via wget as above and also to pull streaming via Spark. Zeppelin lets you visualize as well. It has its own scheduler.

Sqoop, Oozie and Zeppelin come out of the box with the HDP platform

NiFi is part of the HDF platform and easily integrates with the HDFS

It is not difficult to set up a linux box to communicate with HDFS

View solution in original post

2 REPLIES 2

avatar
Guru

From a relational database to HDFS or Hive, Sqoop is your best tool http://hortonworks.com/apache/sqoop/ You can schedule it through Oozie http://hortonworks.com/apache/oozie/

For diverse sources like logs, emails, rss, etc NiFi is your best bet. http://hortonworks.com/apache/nifi/ This includes Restful API capabilities via easy-to-configure HTTP processors. It has its own scheduler. HCC has many articles on NiFi.

You could also do a Restful wget from a linux server and push this to hdfs.

You could also use Zeppelin to pull via wget as above and also to pull streaming via Spark. Zeppelin lets you visualize as well. It has its own scheduler.

Sqoop, Oozie and Zeppelin come out of the box with the HDP platform

NiFi is part of the HDF platform and easily integrates with the HDFS

It is not difficult to set up a linux box to communicate with HDFS

avatar
Master Guru

NIFI/HDF is the way, very easy and a huge number of sources.

https://community.hortonworks.com/articles/52415/processing-social-media-feeds-in-stream-with-apach....

ttps://community.hortonworks.com/content/kbentry/47854/accessing-facebook-page-data-from-apache-nifi.html h

ttps://community.hortonworks.com/articles/46258/iot-example-in-apache-nifi-consuming-and-producing.html h

ttps://community.hortonworks.com/articles/45531/using-apache-nifi-070s-new-putslack-processor.html h

ttps://community.hortonworks.com/articles/45706/using-the-new-hiveql-processors-in-apache-nifi-070.html h

https://community.hortonworks.com/content/kbentry/44018/create-kafka-topic-and-use-from-apache-nifi-...

https://community.hortonworks.com/content/kbentry/55839/reading-sensor-data-from-remote-sensors-on-r...