Support Questions
Find answers, ask questions, and share your expertise

How to pull data from API and store it in HDFS

Expert Contributor

I am aware of flume and Kafka but these are event driven tools. I don't need it to be event driven or real time but may be just schedule the import once in a day.

What are the data ingestion tools available for importing data from API's in HDFS?

I am not using HBase either but Hive. I have used `R` language for that for quite a time but I am looking for a more robust,may be native solution to Hadoop environment.

1 ACCEPTED SOLUTION

Accepted Solutions

Guru

From a relational database to HDFS or Hive, Sqoop is your best tool http://hortonworks.com/apache/sqoop/ You can schedule it through Oozie http://hortonworks.com/apache/oozie/

For diverse sources like logs, emails, rss, etc NiFi is your best bet. http://hortonworks.com/apache/nifi/ This includes Restful API capabilities via easy-to-configure HTTP processors. It has its own scheduler. HCC has many articles on NiFi.

You could also do a Restful wget from a linux server and push this to hdfs.

You could also use Zeppelin to pull via wget as above and also to pull streaming via Spark. Zeppelin lets you visualize as well. It has its own scheduler.

Sqoop, Oozie and Zeppelin come out of the box with the HDP platform

NiFi is part of the HDF platform and easily integrates with the HDFS

It is not difficult to set up a linux box to communicate with HDFS

View solution in original post

2 REPLIES 2

Guru

From a relational database to HDFS or Hive, Sqoop is your best tool http://hortonworks.com/apache/sqoop/ You can schedule it through Oozie http://hortonworks.com/apache/oozie/

For diverse sources like logs, emails, rss, etc NiFi is your best bet. http://hortonworks.com/apache/nifi/ This includes Restful API capabilities via easy-to-configure HTTP processors. It has its own scheduler. HCC has many articles on NiFi.

You could also do a Restful wget from a linux server and push this to hdfs.

You could also use Zeppelin to pull via wget as above and also to pull streaming via Spark. Zeppelin lets you visualize as well. It has its own scheduler.

Sqoop, Oozie and Zeppelin come out of the box with the HDP platform

NiFi is part of the HDF platform and easily integrates with the HDFS

It is not difficult to set up a linux box to communicate with HDFS

View solution in original post

Super Guru

NIFI/HDF is the way, very easy and a huge number of sources.

https://community.hortonworks.com/articles/52415/processing-social-media-feeds-in-stream-with-apach....

ttps://community.hortonworks.com/content/kbentry/47854/accessing-facebook-page-data-from-apache-nifi.html h

ttps://community.hortonworks.com/articles/46258/iot-example-in-apache-nifi-consuming-and-producing.html h

ttps://community.hortonworks.com/articles/45531/using-apache-nifi-070s-new-putslack-processor.html h

ttps://community.hortonworks.com/articles/45706/using-the-new-hiveql-processors-in-apache-nifi-070.html h

https://community.hortonworks.com/content/kbentry/44018/create-kafka-topic-and-use-from-apache-nifi-...

https://community.hortonworks.com/content/kbentry/55839/reading-sensor-data-from-remote-sensors-on-r...