Member since
08-21-2013
2
Posts
1
Kudos Received
0
Solutions
08-21-2013
08:47 AM
1 Kudo
As many of you reading this may already know, Cloudera has previously provided some excellent examples of how to use Flume to ingest Twitter data into HADOOP, and analyze with Hue. http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/ http://blog.cloudera.com/blog/2013/03/how-to-analyze-twitter-data-with-hue/ https://github.com/cloudera/cdh-twitter-example As an alternative to writing to HDFS, I’ve written a small prototype (available on GitHub), using Flume, to write the tweets to Hbase and then report directly in real-time via Impala. If you wish to setup this prototype then: 1. Setup Hadoop and follow Cloudera’s Twitter example: setting up Flume and Twitter4J API to write tweets to HDFS: --------------------------------------------------------------------------------------------------------------------------------------------- Cloudera’s Steps: http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/ https://github.com/cloudera/cdh-twitter-example Dan Sander (www.datadansandler.com) has also created a document and videos walking through the entire process in detail http://www.datadansandler.com/2013/03/making-clouderas-twitter-stream-real.html http://www.youtube.com/watch?v=2xO_8P09M38&list=PLPrplWpTfYTPU2topP8hJwpekrFj4wF8G I found this a useful additional reference if I got stuck following Clouderas Steps. 2. Setup Flume to write to HBASE, Impala --------------------------------------------------- https://github.com/AronMacDonald/Twitter_Hbase_Impala Note: MyFlumeHbaseSinkcodewasinspiredbyDanSandler’sApacheWebLogFlumeHbaseexample https://github.com/DataDanSandler/log_analysis In Hbase you need to create a table to store the tweets: sudo -u hdfs hbase shell create 'tweets', {NAME => 'tweet'}, {NAME => 'retweeted_status'}, {NAME => 'entities'}, {NAME => 'user'} In Impala you create a table linked to the HBASE table: CREATE EXTERNAL TABLE HB_IMPALA_TWEETS ( id int, id_str string, text string, created_at timestamp, geo_latitude double, geo_longitude double, user_screen_name string, user_location string, user_followers_count string, user_profile_image_url string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude, user:screen_name,user:location,user:followers_count,user:profile_image_url" ) TBLPROPERTIES("hbase.table.name" = "tweets"); For those that are interested in integrating with SAP HANA I’ve also added some logic in the Flume event to write a subset of fields to SAP HANA as well. Further details on that are on an SAP blog http://scn.sap.com/community/developer-center/hana/blog/2013/08/07/streaming-real-time-data-to-hadoop-and-hana I’m still working on other parts of the prototype, to make use of the Tweet information both within Impala and HANA Hopefully I’ll be able to share that as well if/when I get it working. 🙂 In the mean time I’ve recently seen 2 other examples of using Twitter data for Sentiment Analysis which may interest: Hortonworks http://www.youtube.com/watch?feature=player_embedded&v=y3nFfsTnY3M SAP HANA SCN http://scn.sap.com/community/developer-center/hana/blog/2013/06/19/real-time-sentiment-rating-of-movies-on-sap-hana-one
... View more