Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Tweet streams with Storm

avatar
Expert Contributor

Hello,

I am pretty new to Storm and I am getting started by trying to process some tweet streams with it. What would be the basic steps to start it?

I am aware of there is a stream api for it (https://dev.twitter.com/streaming/overview), but how would I integrate it with my Storm elements to start making it work.

Any insights appreciated.

Thanks!

1 ACCEPTED SOLUTION

avatar

Hi Wellington,

There are few examples out there to use the Twitter API with Storm. You should have a look to Hortonworks tutorials (like this one http://hortonworks.com/hadoop-tutorial/ingesting-processing-real-time-events-apache-storm/) to have explained and comprehensive examples using the sandbox.

If you only want simple code examples, you will find a lot by searching for "github twitter storm" on the internet. A possible example is: https://github.com/pvillard31/storm-twitter

But there are a lot of possibilities depending on how you want to use the Twitter API. For example, you will find some examples that leverages the Trident API in Storm.

Hope that helps!

View solution in original post

16 REPLIES 16

avatar

Hi Wellington,

There are few examples out there to use the Twitter API with Storm. You should have a look to Hortonworks tutorials (like this one http://hortonworks.com/hadoop-tutorial/ingesting-processing-real-time-events-apache-storm/) to have explained and comprehensive examples using the sandbox.

If you only want simple code examples, you will find a lot by searching for "github twitter storm" on the internet. A possible example is: https://github.com/pvillard31/storm-twitter

But there are a lot of possibilities depending on how you want to use the Twitter API. For example, you will find some examples that leverages the Trident API in Storm.

Hope that helps!

avatar
Expert Contributor

Thanks Pierre. I am starting with this tutorial on my 3 node cluster ( I am not using sandbox) to get familiar with Storm: http://hortonworks.com/hadoop-tutorial/ingesting-processing-real-time-events-apache-storm

I am following all the steps until the point I have to run: [root@sandbox ~]# storm jar storm-starter-0.0.1-storm-0.9.0.1.jar storm.starter.WordCountTopologyWordCount-c storm.starter.WordCountTopologyWordCount-c nimbus.host=sandbox.hortonworks.com

Here I am getting some errors. Should I run this command from any specific folder.

Here is a partial description of the error:

at backtype.storm.StormSubmitter.submitTopology(StormSubmitter.java:271) [storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

at backtype.storm.StormSubmitter.submitTopology(StormSubmitter.java:157) [storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

at storm.starter.WordCountTopology.main(WordCountTopology.java:77) [storm-starter-0.0.1-storm-0.9.0.1.jar:?]

Caused by: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Conexão recusada

at org.apache.thrift7.transport.TSocket.open(TSocket.java:187) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

at backtype.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:102) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

at backtype.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:48) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

... 8 more

Caused by: java.net.ConnectException: Conexão recusada

at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_60]

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_60]

at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_60]

at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_60]

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_60]

at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_60]

at org.apache.thrift7.transport.TSocket.open(TSocket.java:182) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

at backtype.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:102) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

at backtype.storm.security.auth.TBackoffConnect.doConnectWithRetry(TBackoffConnect.java:48) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

... 8 more

Exception in thread "main" java.lang.RuntimeException: Could not find leader nimbus from seed hosts [ip-172-31-34-25.sa-east-1.compute.internal]. Did you specify a valid list of nimbus hosts for config nimbus.seeds

at backtype.storm.utils.NimbusClient.getConfiguredClientAs(NimbusClient.java:90)

at backtype.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:225)

at backtype.storm.StormSubmitter.submitTopology(StormSubmitter.java:271)

at backtype.storm.StormSubmitter.submitTopology(StormSubmitter.java:157)

at storm.starter.WordCountTopology.main(WordCountTopology.java:77)

Thanks-

Wellington

avatar

Hi Wellington,

Sounds like Storm is not in a good shape. Is everything green for Storm in Ambari?

avatar
Expert Contributor

Thanks Pierre. Storm was in good shape with all green. I have changed the nimbus.host to host and it worked. 🙂 Now that I was able to run my first topology on Storm I am looking into ways to adapt it to the case of Twitter streams. Thanks for the link to https://github.com/pvillard31/storm-twitter. I will use this on myinitial test. I am start clone your project repo and start working on it from my instance. I guess I have some questions regarding the connection to the Twitter api as well. Would you have documentation explaining step by step these two processes?

Merci!

avatar

Regarding Twitter API, I used Twitter4j (doc and examples here: http://twitter4j.org/en/code-examples.html). You will need to create a Twitter App from your twitter account to create access tokens that will be needed by the API (https://apps.twitter.com/). Regarding the topology itself, you can have a look at the README on github which explains the idea (in short: it only gets tweets containing specific keywords, aggregates the number of tweets for a given keyword in a given time period using ticks, and stores this information in both HDFS and Hive).

Note: if one of the provided answers in this thread is answering your initial question, could you mark it as accepted? It helps other users when there are looking for information 😉 Thanks.

avatar
Expert Contributor

Thanks Pierre. You info is very useful. I have cloned the project to my instance and I am about to create the .jars. Which classes do you use to build your storm-twitter-0.0.1-SNAPSHOT.jar ?

avatar

To build the jar, you must use maven and run the command "clean package" that will create an uber jar (with all dependencies included). Then you will be able to run your topology using:

storm jar storm-twitter-0.0.1-SNAPSHOT.jar fr.pvillard.storm.topology.Topology <local|cluster> <consumer key> <consumer key secret> <access token key> <access token secret>

where Topology class is the class where I defined the topology and where there is the main method.

avatar
Expert Contributor

Thanks Pierre. Sorry for keeping going back to the same point, but this is the first time I use Maven. I am getting the errors below when I try to clean the package. I am not sure if I have to change something on the pom.xml, or where it is located....

[root@ip-172-31-34-25 bin]# ./mvn clean package https://github.com/pvillard31/storm-twitter

[INFO] Scanning for projects...

[INFO] ------------------------------------------------------------------------

[INFO] BUILD FAILURE

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 0.075 s

[INFO] Finished at: 2016-04-25T23:10:05-04:00

[INFO] Final Memory: 5M/115M

[INFO] ------------------------------------------------------------------------

[ERROR] The goal you specified requires a project to execute but there is no POM in this directory (/opt/apache-maven-3.3.9/bin). Please verify you invoked Maven from the correct directory. -> [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions, please read the following articles:

[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MissingProjectException

avatar

When running a maven command, the best practice is to run it from within the directory containing the pom.xml file (file containing all the Maven instructions for packaging the product). In this case, you should run the command in the clone of the github project where there is the pom file.

cd /path/to/storm-twitter

/opt/apache-maven-3.3.9/bin/mvn clean package

You may also want to add Maven in your path to be able to directly call mvn command.