Created 04-21-2016 11:20 AM
Hello,
I am pretty new to Storm and I am getting started by trying to process some tweet streams with it. What would be the basic steps to start it?
I am aware of there is a stream api for it (https://dev.twitter.com/streaming/overview), but how would I integrate it with my Storm elements to start making it work.
Any insights appreciated.
Thanks!
Created 04-21-2016 11:27 AM
Hi Wellington,
There are few examples out there to use the Twitter API with Storm. You should have a look to Hortonworks tutorials (like this one http://hortonworks.com/hadoop-tutorial/ingesting-processing-real-time-events-apache-storm/) to have explained and comprehensive examples using the sandbox.
If you only want simple code examples, you will find a lot by searching for "github twitter storm" on the internet. A possible example is: https://github.com/pvillard31/storm-twitter
But there are a lot of possibilities depending on how you want to use the Twitter API. For example, you will find some examples that leverages the Trident API in Storm.
Hope that helps!
Created 04-21-2016 11:27 AM
Hi Wellington,
There are few examples out there to use the Twitter API with Storm. You should have a look to Hortonworks tutorials (like this one http://hortonworks.com/hadoop-tutorial/ingesting-processing-real-time-events-apache-storm/) to have explained and comprehensive examples using the sandbox.
If you only want simple code examples, you will find a lot by searching for "github twitter storm" on the internet. A possible example is: https://github.com/pvillard31/storm-twitter
But there are a lot of possibilities depending on how you want to use the Twitter API. For example, you will find some examples that leverages the Trident API in Storm.
Hope that helps!
Created 04-23-2016 02:45 AM
Thanks Pierre. I am starting with this tutorial on my 3 node cluster ( I am not using sandbox) to get familiar with Storm: http://hortonworks.com/hadoop-tutorial/ingesting-processing-real-time-events-apache-storm
I am following all the steps until the point I have to run: [root@sandbox ~]# storm jar storm-starter-0.0.1-storm-0.9.0.1.jar storm.starter.WordCountTopologyWordCount-c storm.starter.WordCountTopologyWordCount-c nimbus.host=sandbox.hortonworks.com
Here I am getting some errors. Should I run this command from any specific folder.
Here is a partial description of the error:
at backtype.storm.StormSubmitter.submitTopology(StormSubmitter.java:271) [storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
at backtype.storm.StormSubmitter.submitTopology(StormSubmitter.java:157) [storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
at storm.starter.WordCountTopology.main(WordCountTopology.java:77) [storm-starter-0.0.1-storm-0.9.0.1.jar:?]
Caused by: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Conexão recusada
at org.apache.thrift7.transport.TSocket.open(TSocket.java:187) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
at backtype.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:102) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
at backtype.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:48) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
... 8 more
Caused by: java.net.ConnectException: Conexão recusada
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_60]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_60]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_60]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_60]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_60]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_60]
at org.apache.thrift7.transport.TSocket.open(TSocket.java:182) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
at backtype.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:102) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
at backtype.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:48) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
... 8 more
Exception in thread "main" java.lang.RuntimeException: Could not find leader nimbus from seed hosts [ip-172-31-34-25.sa-east-1.compute.internal]. Did you specify a valid list of nimbus hosts for config nimbus.seeds
at backtype.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:90)
at backtype.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:225)
at backtype.storm.StormSubmitter.submitTopology(StormSubmitter.java:271)
at backtype.storm.StormSubmitter.submitTopology(StormSubmitter.java:157)
at storm.starter.WordCountTopology.main(WordCountTopology.java:77)
Thanks-
Wellington
Created 04-23-2016 10:52 AM
Hi Wellington,
Sounds like Storm is not in a good shape. Is everything green for Storm in Ambari?
Created 04-24-2016 04:05 PM
Thanks Pierre. Storm was in good shape with all green. I have changed the nimbus.host to host and it worked. 🙂 Now that I was able to run my first topology on Storm I am looking into ways to adapt it to the case of Twitter streams. Thanks for the link to https://github.com/pvillard31/storm-twitter. I will use this on myinitial test. I am start clone your project repo and start working on it from my instance. I guess I have some questions regarding the connection to the Twitter api as well. Would you have documentation explaining step by step these two processes?
Merci!
Created 04-24-2016 04:24 PM
Regarding Twitter API, I used Twitter4j (doc and examples here: http://twitter4j.org/en/code-examples.html). You will need to create a Twitter App from your twitter account to create access tokens that will be needed by the API (https://apps.twitter.com/). Regarding the topology itself, you can have a look at the README on github which explains the idea (in short: it only gets tweets containing specific keywords, aggregates the number of tweets for a given keyword in a given time period using ticks, and stores this information in both HDFS and Hive).
Note: if one of the provided answers in this thread is answering your initial question, could you mark it as accepted? It helps other users when there are looking for information 😉 Thanks.
Created 04-25-2016 02:02 AM
Thanks Pierre. You info is very useful. I have cloned the project to my instance and I am about to create the .jars. Which classes do you use to build your storm-twitter-0.0.1-SNAPSHOT.jar ?
Created 04-25-2016 07:42 AM
To build the jar, you must use maven and run the command "clean package" that will create an uber jar (with all dependencies included). Then you will be able to run your topology using:
storm jar storm-twitter-0.0.1-SNAPSHOT.jar fr.pvillard.storm.topology.Topology <local|cluster> <consumer key> <consumer key secret> <access token key> <access token secret>
where Topology class is the class where I defined the topology and where there is the main method.
Created 04-26-2016 03:16 AM
Thanks Pierre. Sorry for keeping going back to the same point, but this is the first time I use Maven. I am getting the errors below when I try to clean the package. I am not sure if I have to change something on the pom.xml, or where it is located....
[root@ip-172-31-34-25 bin]# ./mvn clean package https://github.com/pvillard31/storm-twitter
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.075 s
[INFO] Finished at: 2016-04-25T23:10:05-04:00
[INFO] Final Memory: 5M/115M
[INFO] ------------------------------------------------------------------------
[ERROR] The goal you specified requires a project to execute but there is no POM in this directory (/opt/apache-maven-3.3.9/bin). Please verify you invoked Maven from the correct directory. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MissingProjectException
Created 04-26-2016 07:05 AM
When running a maven command, the best practice is to run it from within the directory containing the pom.xml file (file containing all the Maven instructions for packaging the product). In this case, you should run the command in the clone of the github project where there is the pom file.
cd /path/to/storm-twitter
/opt/apache-maven-3.3.9/bin/mvn clean package
You may also want to add Maven in your path to be able to directly call mvn command.