Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Use Flume to get a webpage data. How to configure, how to use it to stream data

avatar
Expert Contributor

Hi,

 

I have 3 node cluster, using latest cloudera parcels for 5.9 version. OS is CentOS 6.7 on all three of them.

 

I am using Flume for the 1st time. I have just used 'add service' option on CLoudera GUI to add Flume. 

My purpose is to get the data from a webpage to hdfs/hbase.

 

Can you please help me how can I do it? what else do I need to make the data streaming from a webpage possible.

 

Also, I have a seen an example on net for Twitter, there we need to make token on twitter page to get the data. However the webpage I am referring is a 3rd party one and I am not sure how to configure Flume to get the data on my cluster. I guess it would be over http.

 

Please help me to get this done.

 

Thanks in Advance.

Shilpa

 

 

1 ACCEPTED SOLUTION

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
18 REPLIES 18

avatar
Expert Contributor

@pdvorak / @hshreedharan

 

I ran a curl on the IP and saw it is using port 80 to connect to the news webpage. Even Telnet is working on the port 80.

 

[root@LnxMasterNode01 /]# telnet 132.247.1.32 80
Trying 132.247.1.32...
Connected to 132.247.1.32.
Escape character is '^]'.
^CConnection closed by foreign host.

 

However when restarting flume, I am getting the same error as earlier. Can this be ONLY related to absence of plugins.d(see the previous post)

 

2016-12-19 16:45:00,353 WARN org.mortbay.log: failed SelectChannelConnector@132.247.1.32:80: java.net.BindException: Cannot assign requested address
2016-12-19 16:45:00,353 WARN org.mortbay.log: failed Server@36772002: java.net.BindException: Cannot assign requested address
2016-12-19 16:45:00,353 ERROR org.apache.flume.source.http.HTTPSource: Error while starting HTTPSource. Exception follows.
java.net.BindException: Cannot assign requested address
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:207)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-12-19 16:45:00,364 ERROR org.apache.flume.lifecycle.LifecycleSupervisor: Unable to start EventDrivenSourceRunner: { source:org.apache.flume.source.http.HTTPSource{
name:http-source,state:IDLE} } - Exception follows.
java.lang.RuntimeException: java.net.BindException: Cannot assign requested address 

 

Please help me resolve this issue.

 

Thanks,

Shilpa

avatar
Expert Contributor

I have edited my flume.conf to

 

# Please paste flume.conf here.

# Sources, channels, and sinks are defined per
# agent name, in this case 'tier1'.
tier1.sources = http-source
tier1.channels = mem-channel-1
tier1.sinks = hdfs-sink
# For each source, channel, and sink, set
# standard properties.
tier1.sources.http-source.type = http
tier1.sources.http-source.handler = org.apache.flume.source.http.JSONHandler
tier1.sources.http-source.bind = localhost
tier1.sources.http-source.url = http://www.jornada.unam.mx/ultimas
tier1.sources.http-source.port = 5440
tier1.sources.http-source.channels = mem-channel-1
tier1.channels.mem-channel-1.type = memory
tier1.sinks.hdfs-sink.type = hdfs
tier1.sinks.hdfs-sink.channel = mem-channel-1
tier1.sinks.hdfs-sink.hdfs.path = hdfs://lnxmasternode01.centralus.cloudapp.azure.com/flume/events/%y-%m-%d/%H%M/%S
# Other properties are specific to each type of
# source, channel, or sink. In this case, we
# specify the capacity of the memory channel.
tier1.channels.mem-channel-1.capacity = 100

 

Now, I can see http-source as started in flume logs.

 

However, no data is getting streamed in hdfs path i mentioned in the config. Can anyone suggest now what to do?

 

-bash-4.1$ hadoop fs -ls /flume
Found 1 items
drwxr-xr-x - flume hdfs 0 2016-12-23 11:49 /flume/events
-bash-4.1$ hadoop fs -ls /flume/events
-bash-4.1$

avatar
Expert Contributor

I checked the flume jars for source, I can find only these with cloudera bundle:

 

[hadoop@LnxMasterNode01 jars]$ ll flume*source*
-rw-r--r-- 1 root root 20586 Oct 21 04:58 flume-avro-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 26893 Oct 21 04:58 flume-jms-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 22843 Oct 21 04:58 flume-kafka-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 61447 Oct 21 04:58 flume-scribe-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 34830 Oct 21 04:58 flume-taildir-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 80709 Oct 21 04:58 flume-thrift-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 14540 Oct 21 04:58 flume-twitter-source-1.6.0-cdh5.9.0.jar

 

 Can this be the reason why http source is not working i.e. data streaming not happening despite no error in flume.log and it says http-source started?

 

How do I get jars related to http-source?

 

Thanks,

Shilpa

avatar
As I stated before, flume can't consume from a remote http server. You would need to have something that could consume from the remote server and then post to flume.

-pd

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
New Contributor

Hi Shilpa,

 

Were you able to get webpage data to HDFS via flume? Please let me know what all you did.

 

Thanks.

avatar
Explorer

Hi @ShilpaSinha,

 

can you share how you get that java code to pull the RSS feed?

 

Regards,

David

avatar
Here is an example for creating a simple java RSS reader and setting flume up to read the output:

http://www.ibm.com/developerworks/library/bd-flumews/

-pd

avatar
Explorer

Hi @pdvorak,

 

thanks a lot for you answer, I've already checked that page and it helped.

 

Thanks again.

DB