Created on 08-21-2013 08:47 AM - edited 09-16-2022 08:05 AM
As many of you reading this may already know, Cloudera has previously provided some excellent examples of how to use Flume to ingest Twitter data into HADOOP, and analyze with Hue.
http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
http://blog.cloudera.com/blog/2013/03/how-to-analyze-twitter-data-with-hue/
https://github.com/cloudera/cdh-twitter-example
As an alternative to writing to HDFS, I’ve written a small prototype (available on GitHub), using Flume, to write the tweets to Hbase and then report directly in real-time via Impala.
If you wish to setup this prototype then:
1. Setup Hadoop and follow Cloudera’s Twitter example: setting up Flume and Twitter4J API to write tweets to HDFS:
---------------------------------------------------------------------------------------------------------------------------------------------
Cloudera’s Steps:
https://github.com/cloudera/cdh-twitter-example
Dan Sander (www.datadansandler.com) has also created a document and videos walking through the entire process in detail
http://www.datadansandler.com/2013/03/making-clouderas-twitter-stream-real.html
http://www.youtube.com/watch?v=2xO_8P09M38&list=PLPrplWpTfYTPU2topP8hJwpekrFj4wF8G
I found this a useful additional reference if I got stuck following Clouderas Steps.
2. Setup Flume to write to HBASE, Impala
---------------------------------------------------
https://github.com/AronMacDonald/Twitter_Hbase_Impala
Note: MyFlumeHbaseSinkcodewasinspiredbyDanSandler’sApacheWebLogFlumeHbaseexample
https://github.com/DataDanSandler/log_analysis
In Hbase you need to create a table to store the tweets:
sudo -u hdfs hbase shell
create 'tweets', {NAME => 'tweet'}, {NAME => 'retweeted_status'}, {NAME => 'entities'}, {NAME => 'user'}
In Impala you create a table linked to the HBASE table:
CREATE EXTERNAL TABLE HB_IMPALA_TWEETS (
id int,
id_str string,
text string,
created_at timestamp,
geo_latitude double,
geo_longitude double,
user_screen_name string,
user_location string,
user_followers_count string,
user_profile_image_url string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" =
":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude, user:screen_name,user:location,user:followers_count,user:profile_image_url"
)
TBLPROPERTIES("hbase.table.name" = "tweets");
For those that are interested in integrating with SAP HANA I’ve also added some logic in the Flume event to write a subset of fields to SAP HANA as well.
Further details on that are on an SAP blog
I’m still working on other parts of the prototype, to make use of the Tweet information both within Impala and HANA
Hopefully I’ll be able to share that as well if/when I get it working. 🙂
In the mean time I’ve recently seen 2 other examples of using Twitter data for Sentiment Analysis which may interest:
Hortonworks http://www.youtube.com/watch?feature=player_embedded&v=y3nFfsTnY3M
SAP HANA SCN http://scn.sap.com/community/developer-center/hana/blog/2013/06/19/real-time-sentiment-rating-of-mov...
Created 10-16-2013 12:23 PM
Thanks for the post, Aron.
Created 10-16-2013 12:23 PM
Thanks for the post, Aron.
Created 06-23-2014 07:44 AM
Hello,
When I try to link the external table to impala from hbase i get:
CREATE EXTERNAL TABLE HB_IMPALA_TWEETS (
> id int,
> id_str string,
> text string,
> created_at timestamp,
> geo_latitude double,
> geo_longitude double,
> user_screen_name string,
> user_location string,
> user_followers_count string,
> user_profile_image_url string
>
> )
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
> "hbase.columns.mapping" =
> ":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude, user:screen_name,user:location,user:followers_count,user:profile_image_url"
> )
> TBLPROPERTIES("hbase.table.name" = "tweets");
Query: create EXTERNAL TABLE HB_IMPALA_TWEETS ( id int, id_str string, text string, created_at timestamp, geo_latitude double, geo_longitude double, user_screen_name string, user_location string, user_followers_count string, user_profile_image_url string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude, user:screen_name,user:location,user:followers_count,user:profile_image_url" ) TBLPROPERTIES("hbase.table.name" = "tweets")
ERROR: AnalysisException: Syntax error in line 1:
...image_url string ) STORED BY 'org.apache.hadoop.hive.h...
^
Encountered: BY
Expected: AS
CAUSED BY: Exception: Syntax error
Any Idea why it is not working ? Do I need to add a JAR ?
Created 07-01-2014 02:18 AM
We just need to use Hive to create the impala table...