- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to pull data from API and store it in HDFS
- Labels:
-
Apache Hadoop
Created ‎09-15-2016 12:29 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am aware of flume and Kafka but these are event driven tools. I don't need it to be event driven or real time but may be just schedule the import once in a day.
What are the data ingestion tools available for importing data from API's in HDFS?
I am not using HBase either but Hive. I have used `R` language for that for quite a time but I am looking for a more robust,may be native solution to Hadoop environment.
Created ‎09-15-2016 03:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From a relational database to HDFS or Hive, Sqoop is your best tool http://hortonworks.com/apache/sqoop/ You can schedule it through Oozie http://hortonworks.com/apache/oozie/
For diverse sources like logs, emails, rss, etc NiFi is your best bet. http://hortonworks.com/apache/nifi/ This includes Restful API capabilities via easy-to-configure HTTP processors. It has its own scheduler. HCC has many articles on NiFi.
You could also do a Restful wget from a linux server and push this to hdfs.
You could also use Zeppelin to pull via wget as above and also to pull streaming via Spark. Zeppelin lets you visualize as well. It has its own scheduler.
- https://zeppelin.apache.org/docs/0.5.5-incubating/tutorial/tutorial.html
- http://hortonworks.com/apache/zeppelin/
Sqoop, Oozie and Zeppelin come out of the box with the HDP platform
NiFi is part of the HDF platform and easily integrates with the HDFS
It is not difficult to set up a linux box to communicate with HDFS
Created ‎09-15-2016 03:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From a relational database to HDFS or Hive, Sqoop is your best tool http://hortonworks.com/apache/sqoop/ You can schedule it through Oozie http://hortonworks.com/apache/oozie/
For diverse sources like logs, emails, rss, etc NiFi is your best bet. http://hortonworks.com/apache/nifi/ This includes Restful API capabilities via easy-to-configure HTTP processors. It has its own scheduler. HCC has many articles on NiFi.
You could also do a Restful wget from a linux server and push this to hdfs.
You could also use Zeppelin to pull via wget as above and also to pull streaming via Spark. Zeppelin lets you visualize as well. It has its own scheduler.
- https://zeppelin.apache.org/docs/0.5.5-incubating/tutorial/tutorial.html
- http://hortonworks.com/apache/zeppelin/
Sqoop, Oozie and Zeppelin come out of the box with the HDP platform
NiFi is part of the HDF platform and easily integrates with the HDFS
It is not difficult to set up a linux box to communicate with HDFS
Created ‎09-15-2016 06:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
NIFI/HDF is the way, very easy and a huge number of sources.
ttps://community.hortonworks.com/content/kbentry/47854/accessing-facebook-page-data-from-apache-nifi.html h
ttps://community.hortonworks.com/articles/46258/iot-example-in-apache-nifi-consuming-and-producing.html h
ttps://community.hortonworks.com/articles/45531/using-apache-nifi-070s-new-putslack-processor.html h
ttps://community.hortonworks.com/articles/45706/using-the-new-hiveql-processors-in-apache-nifi-070.html h
