Created 04-21-2016 11:20 AM
Hello,
I am pretty new to Storm and I am getting started by trying to process some tweet streams with it. What would be the basic steps to start it?
I am aware of there is a stream api for it (https://dev.twitter.com/streaming/overview), but how would I integrate it with my Storm elements to start making it work.
Any insights appreciated.
Thanks!
Created 04-21-2016 11:27 AM
Hi Wellington,
There are few examples out there to use the Twitter API with Storm. You should have a look to Hortonworks tutorials (like this one http://hortonworks.com/hadoop-tutorial/ingesting-processing-real-time-events-apache-storm/) to have explained and comprehensive examples using the sandbox.
If you only want simple code examples, you will find a lot by searching for "github twitter storm" on the internet. A possible example is: https://github.com/pvillard31/storm-twitter
But there are a lot of possibilities depending on how you want to use the Twitter API. For example, you will find some examples that leverages the Trident API in Storm.
Hope that helps!
Created 04-27-2016 12:25 AM
Hey Pierre. I performed a successful build however I am getting an error saying that the Topology class was not found when I run this:
[root@ip-172-31-34-25 storm-twitter]# storm jar storm-twitter-0.0.1-SNAPSHOT.jar fr.pvillard.storm.topology.Topology host=ec2-52-67-8-253.sa-east-1.compute.amazonaws.com kFI3G29IJ5UOMnbe3qmJpDw5L iZszClk61Lfdu6hTxRAIW1STPX1TtbFXpIKlehxHNUGIpMWYFT 140206682-thGEZ8KIYfYbHY9Rzvzu2CO8ry6UBmSEvUe0zOGZ 8uA5F0T4yhnLv16fgrFP4S6W5ETflmGzLd3dPW1chb46v
Error: Not able to locate nor load fr.pvillard.storm.topology.Topology
Should I run the command above from specific folder?
Thanks-
Wellington
Created 04-30-2016 03:26 AM
Thanks for the tips Pierre. I was able to run the topoly with no errors (apparently):
460 [main] INFO b.s.u.Utils - Using defaults.yaml from resources
523 [main] INFO b.s.u.Utils - Using storm.yaml from resources
584 [main] INFO b.s.u.Utils - Using defaults.yaml from resources
602 [main] INFO b.s.u.Utils - Using storm.yaml from resources
608 [main] INFO b.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -8800577250600957523:-7254875077049838623
609 [main] INFO b.s.s.a.AuthUtils - Got AutoCreds []
624 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]
653 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]
654 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]
659 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]
663 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]
668 [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [2000] the maxSleepTimeMs [60000] the maxRetries [5]
676 [main] INFO b.s.StormSubmitter - Uploading topology jar storm-twitter-0.0.1-SNAPSHOT.jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-cac9d9fd-128d-43d2-b13d-5effadbbbe75.jar
1702 [main] INFO b.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-cac9d9fd-128d-43d2-b13d-5effadbbbe75.jar
1702 [main] INFO b.s.StormSubmitter - Submitting topology storm-twitter in distributed mode with conf {"topology.message.timeout.secs":120,"storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-8800577250600957523:-7254875077049838623"}
1939 [main] INFO b.s.StormSubmitter - Finished submitting topology: storm-twitter
I am checking in the Storm UI and things seems to be OK, however not much processing is going on (nothing seems to bee emitting or transferred). I don't have any output stats from the Spout nor errors; Same thing for all the other Bolts. I posted 2 tweets from my account and created the HIVE table indicated on your docs but no resultas have ever appeared.
How often does your spouts collect new streams from the twitter account? My Keys and Access token are all set to ready and write... What do you think might be possible causes for not getting anything here? Any tips around the best way to troubleshoot this kind of thing?
Thanks!
Created 04-30-2016 12:58 PM
Thanks Pierre. For some reason when I try to run on local I get these kind of exceptions:
10887 [Thread-20-TweetHdfsBolt] WARN o.a.h.u.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
11100 [Thread-20-TweetHdfsBolt] ERROR b.s.util - Async loop died!
java.lang.RuntimeException: Error preparing HdfsBolt: java.net.UnknownHostException: mycluster
at org.apache.storm.hdfs.bolt.AbstractHdfsBolt.prepare(AbstractHdfsBolt.java:109) ~[storm-twitter-0.0.1-SNAPSHOT.jar:?]
at backtype.storm.daemon.executor$fn__7245$fn__7258.invoke(executor.clj:746) ~[storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
at backtype.storm.util$async_loop$fn__544.invoke(util.clj:473) [storm-core-0.10.0.2.4.0.0-169.jar:0.10.0.2.4.0.0-169]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]
I was wondering if I should perform this setup (cloning the package and configuring the pom.xml) in order to make the stream collection from Twitter work, or if something that has been included on your package: http://twitter4j.org/en/index.html
Thanks!
Created 05-01-2016 12:32 AM
Thanks again. It worked. I have performed the cahnges and also fixed a permission issue around the /storm/ folder on hdfs. Things seem to be working very well now and I don't see any errors on the Storm UI. However when I go to the HIVE view and try to do a simple SELECT * FROM tweet_counts LIMIT 10; here are the exceptions I get (see below). Have you ever run into this? I have been investigating, but not sure where this is coming from...
ps. ambari is the name of the db where I created the tweet_counts table on my HIVE instance.
Created 04-28-2016 08:56 AM
You should run the command where the jar is and it is probably in the target directory under storm-twitter.
There is no argument host=.... This argument is either "local" or "cluster" depending on how you want to run your topology with storm (in a local way to ease debugging for example or in a distributed way).
A final word : you should not post your Twitter credentials, it is private data and could be used by someone else to act on your behalf. I strongly encourage you to regenerate new credentials from the Twitter apps page.
Created 04-30-2016 09:00 AM
Try to run the topology in local mode. It will be easier to see what is happening.
Created 04-30-2016 01:34 PM
No this is because I didn't export every variable out of the code. If you look at the main class describing the topology (https://github.com/pvillard31/storm-twitter/blob/master/src/main/java/fr/pvillard/storm/topology/Topology.java), you will see that I reference the entry points to write into HDFS (line 38), and into Hive (line 36) as well. You should update this class with your own parameters and rebuild the topology using Maven (or better, export the variables as arguments that you give when running the command).