Created on 11-28-2016 04:20 PM - edited 09-16-2022 03:49 AM
Hi,
I have 3 node cluster, using latest cloudera parcels for 5.9 version. OS is CentOS 6.7 on all three of them.
I am using Flume for the 1st time. I have just used 'add service' option on CLoudera GUI to add Flume.
My purpose is to get the data from a webpage to hdfs/hbase.
Can you please help me how can I do it? what else do I need to make the data streaming from a webpage possible.
Also, I have a seen an example on net for Twitter, there we need to make token on twitter page to get the data. However the webpage I am referring is a 3rd party one and I am not sure how to configure Flume to get the data on my cluster. I guess it would be over http.
Please help me to get this done.
Thanks in Advance.
Shilpa
Created on 01-06-2017 04:03 PM - edited 01-06-2017 04:05 PM
@pdvorak thanks!
Yes, i wrote a java code to pull RSS feed and used Exec source and Avro Sink on 2 nodes and Avro Source as collector and HDFS sink on the 3rd node.
Created 12-19-2016 02:56 PM
I ran a curl on the IP and saw it is using port 80 to connect to the news webpage. Even Telnet is working on the port 80.
[root@LnxMasterNode01 /]# telnet 132.247.1.32 80
Trying 132.247.1.32...
Connected to 132.247.1.32.
Escape character is '^]'.
^CConnection closed by foreign host.
However when restarting flume, I am getting the same error as earlier. Can this be ONLY related to absence of plugins.d(see the previous post)
2016-12-19 16:45:00,353 WARN org.mortbay.log: failed SelectChannelConnector@132.247.1.32:80: java.net.BindException: Cannot assign requested address
2016-12-19 16:45:00,353 WARN org.mortbay.log: failed Server@36772002: java.net.BindException: Cannot assign requested address
2016-12-19 16:45:00,353 ERROR org.apache.flume.source.http.HTTPSource: Error while starting HTTPSource. Exception follows.
java.net.BindException: Cannot assign requested address
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:207)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-12-19 16:45:00,364 ERROR org.apache.flume.lifecycle.LifecycleSupervisor: Unable to start EventDrivenSourceRunner: { source:org.apache.flume.source.http.HTTPSource{
name:http-source,state:IDLE} } - Exception follows.
java.lang.RuntimeException: java.net.BindException: Cannot assign requested address
Please help me resolve this issue.
Thanks,
Shilpa
Created 12-26-2016 11:31 AM
I have edited my flume.conf to
# Please paste flume.conf here.
# Sources, channels, and sinks are defined per
# agent name, in this case 'tier1'.
tier1.sources = http-source
tier1.channels = mem-channel-1
tier1.sinks = hdfs-sink
# For each source, channel, and sink, set
# standard properties.
tier1.sources.http-source.type = http
tier1.sources.http-source.handler = org.apache.flume.source.http.JSONHandler
tier1.sources.http-source.bind = localhost
tier1.sources.http-source.url = http://www.jornada.unam.mx/ultimas
tier1.sources.http-source.port = 5440
tier1.sources.http-source.channels = mem-channel-1
tier1.channels.mem-channel-1.type = memory
tier1.sinks.hdfs-sink.type = hdfs
tier1.sinks.hdfs-sink.channel = mem-channel-1
tier1.sinks.hdfs-sink.hdfs.path = hdfs://lnxmasternode01.centralus.cloudapp.azure.com/flume/events/%y-%m-%d/%H%M/%S
# Other properties are specific to each type of
# source, channel, or sink. In this case, we
# specify the capacity of the memory channel.
tier1.channels.mem-channel-1.capacity = 100
Now, I can see http-source as started in flume logs.
However, no data is getting streamed in hdfs path i mentioned in the config. Can anyone suggest now what to do?
-bash-4.1$ hadoop fs -ls /flume
Found 1 items
drwxr-xr-x - flume hdfs 0 2016-12-23 11:49 /flume/events
-bash-4.1$ hadoop fs -ls /flume/events
-bash-4.1$
Created 12-26-2016 04:56 PM
I checked the flume jars for source, I can find only these with cloudera bundle:
[hadoop@LnxMasterNode01 jars]$ ll flume*source*
-rw-r--r-- 1 root root 20586 Oct 21 04:58 flume-avro-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 26893 Oct 21 04:58 flume-jms-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 22843 Oct 21 04:58 flume-kafka-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 61447 Oct 21 04:58 flume-scribe-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 34830 Oct 21 04:58 flume-taildir-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 80709 Oct 21 04:58 flume-thrift-source-1.6.0-cdh5.9.0.jar
-rw-r--r-- 1 root root 14540 Oct 21 04:58 flume-twitter-source-1.6.0-cdh5.9.0.jar
Can this be the reason why http source is not working i.e. data streaming not happening despite no error in flume.log and it says http-source started?
How do I get jars related to http-source?
Thanks,
Shilpa
Created 01-06-2017 12:13 PM
Created on 01-06-2017 04:03 PM - edited 01-06-2017 04:05 PM
@pdvorak thanks!
Yes, i wrote a java code to pull RSS feed and used Exec source and Avro Sink on 2 nodes and Avro Source as collector and HDFS sink on the 3rd node.
Created 02-06-2017 03:21 PM
Hi Shilpa,
Were you able to get webpage data to HDFS via flume? Please let me know what all you did.
Thanks.
Created 06-06-2017 07:32 AM
Created 06-20-2017 09:52 AM
Created 06-20-2017 09:58 AM
Hi @pdvorak,
thanks a lot for you answer, I've already checked that page and it helped.
Thanks again.
DB