Support Questions

dhruvprakash34 · ‎07-08-2017

mhegedus · ‎07-09-2017

Flume does not have website scraping capabilities. One might guess that HttpSource can be used for tasks like this but HttpSource is just a http server running in Flume. You can push data to it and not the other way around.

Check IMDB site, you can download data from Amazon S3 but you have to pay the data transfer fee: http://www.imdb.com/interfaces

View solution in original post

mhegedus · ‎07-09-2017

Flume does not have website scraping capabilities. One might guess that HttpSource can be used for tasks like this but HttpSource is just a http server running in Flume. You can push data to it and not the other way around.

Check IMDB site, you can download data from Amazon S3 but you have to pay the data transfer fee: http://www.imdb.com/interfaces

dhruvprakash34 · ‎07-13-2017

Thank you for your response ,

i see the flume does not have website scraping capabilities . But we can easily pull real time tweets by creating an application on twitter. Likewise is the same thing posiible for any other website like blogger , quora or anything else ??

@mhegedus

mhegedus · ‎07-13-2017

It is possible if the website publishes their streaming data via a public API and if you implement a custom Flume source to ingest that. In case of Twitter there is an API for that but you have to pay to use it. In case of quora or blogger I am not sure if it exists. An option could be to write code that reads RSS feeds and writes that to disk or hdfs but to do this you do not need Flume.

Cloudera Community

Support Questions

I want to fetch IMDB data using flume . I am skeptical about confuguring flume for it . Please help