Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Tweet streams with Storm

avatar
Expert Contributor

Hello,

I am pretty new to Storm and I am getting started by trying to process some tweet streams with it. What would be the basic steps to start it?

I am aware of there is a stream api for it (https://dev.twitter.com/streaming/overview), but how would I integrate it with my Storm elements to start making it work.

Any insights appreciated.

Thanks!

1 ACCEPTED SOLUTION

avatar

Hi Wellington,

There are few examples out there to use the Twitter API with Storm. You should have a look to Hortonworks tutorials (like this one http://hortonworks.com/hadoop-tutorial/ingesting-processing-real-time-events-apache-storm/) to have explained and comprehensive examples using the sandbox.

If you only want simple code examples, you will find a lot by searching for "github twitter storm" on the internet. A possible example is: https://github.com/pvillard31/storm-twitter

But there are a lot of possibilities depending on how you want to use the Twitter API. For example, you will find some examples that leverages the Trident API in Storm.

Hope that helps!

View solution in original post

16 REPLIES 16

avatar
Expert Contributor

Hey Pierre. I performed a successful build however I am getting an error saying that the Topology class was not found when I run this:

[root@ip-172-31-34-25 storm-twitter]# storm jar storm-twitter-0.0.1-SNAPSHOT.jar fr.pvillard.storm.topology.Topology host=ec2-52-67-8-253.sa-east-1.compute.amazonaws.com kFI3G29IJ5UOMnbe3qmJpDw5L iZszClk61Lfdu6hTxRAIW1STPX1TtbFXpIKlehxHNUGIpMWYFT 140206682-thGEZ8KIYfYbHY9Rzvzu2CO8ry6UBmSEvUe0zOGZ 8uA5F0T4yhnLv16fgrFP4S6W5ETflmGzLd3dPW1chb46v

Error: Not able to locate nor load fr.pvillard.storm.topology.Topology

Should I run the command above from specific folder?

Thanks-

Wellington

avatar
Expert Contributor

Thanks for the tips Pierre. I was able to run the topoly with no errors (apparently):

460 [main] INFO b.s.u.Utils - Using defaults.yaml from resources

523 [main] INFO b.s.u.Utils - Using storm.yaml from resources

584 [main] INFO b.s.u.Utils - Using defaults.yaml from resources

602 [main] INFO b.s.u.Utils - Using storm.yaml from resources

608 [main] INFO b.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -8800577250600957523:-7254875077049838623

609 [main] INFO b.s.s.a.AuthUtils - Got AutoCreds []

624 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]

653 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]

654 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]

659 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]

663 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]

668 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]

676 [main] INFO b.s.StormSubmitter - Uploading topology jar storm-twitter-0.0.1-SNAPSHOT.jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-cac9d9fd-128d-43d2-b13d-5effadbbbe75.jar

1702 [main] INFO b.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-cac9d9fd-128d-43d2-b13d-5effadbbbe75.jar

1702 [main] INFO b.s.StormSubmitter - Submitting topology storm-twitter in distributed mode with conf {"topology.message.timeout.secs":120,"storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-8800577250600957523:-7254875077049838623"}

1939 [main] INFO b.s.StormSubmitter - Finished submitting topology: storm-twitter

I am checking in the Storm UI and things seems to be OK, however not much processing is going on (nothing seems to bee emitting or transferred). I don't have any output stats from the Spout nor errors; Same thing for all the other Bolts. I posted 2 tweets from my account and created the HIVE table indicated on your docs but no resultas have ever appeared.

How often does your spouts collect new streams from the twitter account? My Keys and Access token are all set to ready and write... What do you think might be possible causes for not getting anything here? Any tips around the best way to troubleshoot this kind of thing?

Thanks!

avatar
Expert Contributor

Thanks Pierre. For some reason when I try to run on local I get these kind of exceptions:

10887 [Thread-20-TweetHdfsBolt] WARN o.a.h.u.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

11100 [Thread-20-TweetHdfsBolt] ERROR b.s.util - Async loop died!

java.lang.RuntimeException: Error preparing HdfsBolt: java.net.UnknownHostException: mycluster

at org.apache.storm.hdfs.bolt.AbstractHdfsBolt.prepare(AbstractHdfsBolt.java:109) ~[storm-twitter-0.0.1-SNAPSHOT.jar:?]

at backtype.storm.daemon.executor$fn__7245$fn__7258.invoke(executor.clj:746) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

at backtype.storm.util$async_loop$fn__544.invoke(util.clj:473) [storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]

at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:?]

at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]

I was wondering if I should perform this setup (cloning the package and configuring the pom.xml) in order to make the stream collection from Twitter work, or if something that has been included on your package: http://twitter4j.org/en/index.html

Thanks!

avatar
Expert Contributor

Thanks again. It worked. I have performed the cahnges and also fixed a permission issue around the /storm/ folder on hdfs. Things seem to be working very well now and I don't see any errors on the Storm UI. However when I go to the HIVE view and try to do a simple SELECT * FROM tweet_counts LIMIT 10; here are the exceptions I get (see below). Have you ever run into this? I have been investigating, but not sure where this is coming from...

ps. ambari is the name of the db where I created the tweet_counts table on my HIVE instance.

{"trace":"org.apache.ambari.view.hive.client.HiveErrorStatusException: H170 Unable to fetch results. java.io.IOException: java.io.FileNotFoundException: Path is not a file: /apps/hive/warehouse/ambari.db/tweet_counts\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)\n\tat .......

avatar

You should run the command where the jar is and it is probably in the target directory under storm-twitter.

There is no argument host=.... This argument is either "local" or "cluster" depending on how you want to run your topology with storm (in a local way to ease debugging for example or in a distributed way).

A final word : you should not post your Twitter credentials, it is private data and could be used by someone else to act on your behalf. I strongly encourage you to regenerate new credentials from the Twitter apps page.

avatar

Try to run the topology in local mode. It will be easier to see what is happening.

avatar

No this is because I didn't export every variable out of the code. If you look at the main class describing the topology (https://github.com/pvillard31/storm-twitter/blob/master/src/main/java/fr/pvillard/storm/topology/Topology.java), you will see that I reference the entry points to write into HDFS (line 38), and into Hive (line 36) as well. You should update this class with your own parameters and rebuild the topology using Maven (or better, export the variables as arguments that you give when running the command).