Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

I want to fetch IMDB data using flume . I am skeptical about confuguring flume for it . Please help

avatar
New Contributor
 
1 ACCEPTED SOLUTION

avatar
Contributor

Flume does not have website scraping capabilities. One might guess that HttpSource can be used for tasks like this but HttpSource is just a http server running in Flume. You can push data to it and not the other way around.

Check IMDB site, you can download data from Amazon S3 but you have to pay the data transfer fee: http://www.imdb.com/interfaces

View solution in original post

3 REPLIES 3

avatar
Contributor

Flume does not have website scraping capabilities. One might guess that HttpSource can be used for tasks like this but HttpSource is just a http server running in Flume. You can push data to it and not the other way around.

Check IMDB site, you can download data from Amazon S3 but you have to pay the data transfer fee: http://www.imdb.com/interfaces

avatar
New Contributor

Thank you for your response ,

i see the flume does not have website scraping capabilities . But we can easily pull real time tweets by creating an application on twitter. Likewise is the same thing posiible for any other website like blogger , quora or anything else ??

@mhegedus

avatar
Contributor

It is possible if the website publishes their streaming data via a public API and if you implement a custom Flume source to ingest that. In case of Twitter there is an API for that but you have to pay to use it. In case of quora or blogger I am not sure if it exists. An option could be to write code that reads RSS feeds and writes that to disk or hdfs but to do this you do not need Flume.