Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Real-time Analysis of Twitter using Impala

SOLVED Go to solution
Highlighted

Real-time Analysis of Twitter using Impala

New Contributor

As many of you reading this may already know, Cloudera has previously provided some excellent examples of how to use Flume to ingest Twitter data into HADOOP,  and analyze with Hue.

http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

http://blog.cloudera.com/blog/2013/03/how-to-analyze-twitter-data-with-hue/

https://github.com/cloudera/cdh-twitter-example

 

As an alternative to writing to HDFS, I’ve written a small prototype (available on GitHub), using Flume, to write the tweets to Hbase and then report directly in real-time via Impala.

  

If you wish to setup this prototype then:

 

1. Setup Hadoop and follow Cloudera’s Twitter example: setting up Flume and Twitter4J API to write tweets to HDFS:

---------------------------------------------------------------------------------------------------------------------------------------------

Cloudera’s Steps:

http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager...

https://github.com/cloudera/cdh-twitter-example

 

 

Dan Sander (www.datadansandler.com) has also created a document and videos walking through the entire process in detail

http://www.datadansandler.com/2013/03/making-clouderas-twitter-stream-real.html

http://www.youtube.com/watch?v=2xO_8P09M38&list=PLPrplWpTfYTPU2topP8hJwpekrFj4wF8G

I found this a useful additional reference if I got stuck following Clouderas Steps.

 

 

2. Setup  Flume to write to HBASE, Impala

---------------------------------------------------

https://github.com/AronMacDonald/Twitter_Hbase_Impala

Note: MyFlumeHbaseSinkcodewasinspiredbyDanSandlersApacheWebLogFlumeHbaseexample

                             https://github.com/DataDanSandler/log_analysis

 

 

 In Hbase you need to create a table to store the tweets:

  sudo -u hdfs hbase shell

  create 'tweets', {NAME => 'tweet'}, {NAME => 'retweeted_status'}, {NAME => 'entities'}, {NAME => 'user'}

 

In Impala you create a table linked to the HBASE table:

CREATE EXTERNAL TABLE HB_IMPALA_TWEETS (

  id                       int,

  id_str                 string,

  text                      string,

  created_at             timestamp,

  geo_latitude           double,

  geo_longitude          double,

  user_screen_name       string,

  user_location          string,

  user_followers_count   string,

  user_profile_image_url string

  

)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" =

":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude, user:screen_name,user:location,user:followers_count,user:profile_image_url"

)

TBLPROPERTIES("hbase.table.name" = "tweets");

 

 

 

For those that are interested in integrating with SAP HANA I’ve also added some logic in the Flume event to write a subset of fields to SAP HANA as well.

Further details on that are on an SAP blog

http://scn.sap.com/community/developer-center/hana/blog/2013/08/07/streaming-real-time-data-to-hadoo...

 

 

I’m still working on other parts of the prototype, to make use of the Tweet information both within Impala and HANA

Hopefully I’ll be able to share that as well if/when I get it working.     :-)

 

In the mean time I’ve recently seen 2 other examples of using Twitter data for Sentiment Analysis which may interest:

Hortonworks               http://www.youtube.com/watch?feature=player_embedded&v=y3nFfsTnY3M

SAP HANA SCN          http://scn.sap.com/community/developer-center/hana/blog/2013/06/19/real-time-sentiment-rating-of-mov...

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Real-time Analysis of Twitter using Impala

Master Collaborator

Thanks for the post, Aron.

3 REPLIES 3

Re: Real-time Analysis of Twitter using Impala

Master Collaborator

Thanks for the post, Aron.

Re: Real-time Analysis of Twitter using Impala

Explorer

Hello,

 

When I try to link the external table to impala from hbase i get:

 

CREATE EXTERNAL TABLE HB_IMPALA_TWEETS (
> id int,
> id_str string,
> text string,
> created_at timestamp,
> geo_latitude double,
> geo_longitude double,
> user_screen_name string,
> user_location string,
> user_followers_count string,
> user_profile_image_url string
>
> )
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
> "hbase.columns.mapping" =
> ":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude, user:screen_name,user:location,user:followers_count,user:profile_image_url"
> )
> TBLPROPERTIES("hbase.table.name" = "tweets");
Query: create EXTERNAL TABLE HB_IMPALA_TWEETS ( id int, id_str string, text string, created_at timestamp, geo_latitude double, geo_longitude double, user_screen_name string, user_location string, user_followers_count string, user_profile_image_url string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude, user:screen_name,user:location,user:followers_count,user:profile_image_url" ) TBLPROPERTIES("hbase.table.name" = "tweets")
ERROR: AnalysisException: Syntax error in line 1:
...image_url string ) STORED BY 'org.apache.hadoop.hive.h...
^
Encountered: BY
Expected: AS

CAUSED BY: Exception: Syntax error

 

Any Idea why it is not working ? Do I need to add a JAR ? 

--
Lefevre Kevin

Re: Real-time Analysis of Twitter using Impala

Explorer

We just need to use Hive to create the impala table...

--
Lefevre Kevin