Created 07-08-2017 06:30 PM
Created 07-09-2017 06:56 PM
Flume does not have website scraping capabilities. One might guess that HttpSource can be used for tasks like this but HttpSource is just a http server running in Flume. You can push data to it and not the other way around.
Check IMDB site, you can download data from Amazon S3 but you have to pay the data transfer fee: http://www.imdb.com/interfaces
Created 07-09-2017 06:56 PM
Flume does not have website scraping capabilities. One might guess that HttpSource can be used for tasks like this but HttpSource is just a http server running in Flume. You can push data to it and not the other way around.
Check IMDB site, you can download data from Amazon S3 but you have to pay the data transfer fee: http://www.imdb.com/interfaces
Created 07-13-2017 06:55 AM
Thank you for your response ,
i see the flume does not have website scraping capabilities . But we can easily pull real time tweets by creating an application on twitter. Likewise is the same thing posiible for any other website like blogger , quora or anything else ??
@mhegedus
Created 07-13-2017 09:20 AM
It is possible if the website publishes their streaming data via a public API and if you implement a custom Flume source to ingest that. In case of Twitter there is an API for that but you have to pay to use it. In case of quora or blogger I am not sure if it exists. An option could be to write code that reads RSS feeds and writes that to disk or hdfs but to do this you do not need Flume.