Support Questions

Find answers, ask questions, and share your expertise

Use Flume to get a webpage data. How to configure, how to use it to stream data

avatar
Expert Contributor

Hi,

 

I have 3 node cluster, using latest cloudera parcels for 5.9 version. OS is CentOS 6.7 on all three of them.

 

I am using Flume for the 1st time. I have just used 'add service' option on CLoudera GUI to add Flume. 

My purpose is to get the data from a webpage to hdfs/hbase.

 

Can you please help me how can I do it? what else do I need to make the data streaming from a webpage possible.

 

Also, I have a seen an example on net for Twitter, there we need to make token on twitter page to get the data. However the webpage I am referring is a 3rd party one and I am not sure how to configure Flume to get the data on my cluster. I guess it would be over http.

 

Please help me to get this done.

 

Thanks in Advance.

Shilpa

 

 

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@pdvorak thanks!

 

Yes, i wrote a java code to pull RSS feed and used Exec source and Avro Sink on 2 nodes and Avro Source as collector and HDFS sink on the 3rd node.

 

View solution in original post

18 REPLIES 18

avatar
Expert Contributor

@pdvorak / @hshreedharan

 

I ran a curl on the IP and saw it is using port 80 to connect to the news webpage. Even Telnet is working on the port 80.

 

[root@LnxMasterNode01 /]# telnet 132.247.1.32 80
Trying 132.247.1.32...
Connected to 132.247.1.32.
Escape character is '^]'.
^CConnection closed by foreign host.

 

However when restarting flume, I am getting the same error as earlier. Can this be ONLY related to absence of plugins.d(see the previous post)

 

2016-12-19 16:45:00,353 WARN org.mortbay.log: failed SelectChannelConnector@132.247.1.32:80: java.net.BindException: Cannot assign requested address
2016-12-19 16:45:00,353 WARN org.mortbay.log: failed Server@36772002: java.net.BindException: Cannot assign requested address
2016-12-19 16:45:00,353 ERROR org.apache.flume.source.http.HTTPSource: Error while starting HTTPSource. Exception follows.
java.net.BindException: Cannot assign requested address
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:207)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-12-19 16:45:00,364 ERROR org.apache.flume.lifecycle.LifecycleSupervisor: Unable to start EventDrivenSourceRunner: { source:org.apache.flume.source.http.HTTPSource{
name:http-source,state:IDLE} } - Exception follows.
java.lang.RuntimeException: java.net.BindException: Cannot assign requested address 

 

Please help me resolve this issue.

 

Thanks,

Shilpa

avatar
Expert Contributor

I have edited my flume.conf to

 

# Please paste flume.conf here.

# Sources, channels, and sinks are defined per
# agent name, in this case 'tier1'.
tier1.sources = http-source
tier1.channels = mem-channel-1
tier1.sinks = hdfs-sink
# For each source, channel, and sink, set
# standard properties.
tier1.sources.http-source.type = http
tier1.sources.http-source.handler = org.apache.flume.source.http.JSONHandler
tier1.sources.http-source.bind = localhost
tier1.sources.http-source.url = http://www.jornada.unam.mx/ultimas
tier1.sources.http-source.port = 5440
tier1.sources.http-source.channels = mem-channel-1
tier1.channels.mem-channel-1.type = memory
tier1.sinks.hdfs-sink.type = hdfs
tier1.sinks.hdfs-sink.channel = mem-channel-1
tier1.sinks.hdfs-sink.hdfs.path = hdfs://lnxmasternode01.centralus.cloudapp.azure.com/flume/events/%y-%m-%d/%H%M/%S
# Other properties are specific to each type of
# source, channel, or sink. In this case, we
# specify the capacity of the memory channel.
tier1.channels.mem-channel-1.capacity = 100

 

Now, I can see http-source as started in flume logs.

 

However, no data is getting streamed in hdfs path i mentioned in the config. Can anyone suggest now what to do?

 

-bash-4.1$ hadoop fs -ls /flume
Found 1 items
drwxr-xr-x - flume hdfs 0 2016-12-23 11:49 /flume/events
-bash-4.1$ hadoop fs -ls /flume/events
-bash-4.1$

avatar
Expert Contributor

I checked the flume jars for source, I can find only these with cloudera bundle:

 

[hadoop@LnxMasterNode01 jars]$ ll flume*source*
-rw-r--r-- 1 root root 20586 Oct 21 04:58 flume-avro-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 26893 Oct 21 04:58 flume-jms-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 22843 Oct 21 04:58 flume-kafka-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 61447 Oct 21 04:58 flume-scribe-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 34830 Oct 21 04:58 flume-taildir-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 80709 Oct 21 04:58 flume-thrift-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 14540 Oct 21 04:58 flume-twitter-source-1.6.0-cdh5.9.0.jar

 

 Can this be the reason why http source is not working i.e. data streaming not happening despite no error in flume.log and it says http-source started?

 

How do I get jars related to http-source?

 

Thanks,

Shilpa

avatar
As I stated before, flume can't consume from a remote http server. You would need to have something that could consume from the remote server and then post to flume.

-pd

avatar
Expert Contributor

@pdvorak thanks!

 

Yes, i wrote a java code to pull RSS feed and used Exec source and Avro Sink on 2 nodes and Avro Source as collector and HDFS sink on the 3rd node.

 

avatar
New Contributor

Hi Shilpa,

 

Were you able to get webpage data to HDFS via flume? Please let me know what all you did.

 

Thanks.

avatar
Explorer

Hi @ShilpaSinha,

 

can you share how you get that java code to pull the RSS feed?

 

Regards,

David

avatar
Here is an example for creating a simple java RSS reader and setting flume up to read the output:

http://www.ibm.com/developerworks/library/bd-flumews/

-pd

avatar
Explorer

Hi @pdvorak,

 

thanks a lot for you answer, I've already checked that page and it helped.

 

Thanks again.

DB