question Real-time Analysis of Twitter using Impala in Archives of Support Questions (Read Only)

Real-time Analysis of Twitter using Impala

Aron — Tue, 21 Apr 2026 14:02:50 GMT

As many of you reading this may already know, Cloudera has previously provided some excellent examples of how to use Flume to ingest Twitter data into HADOOP, and analyze with Hue.

http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

http://blog.cloudera.com/blog/2013/03/how-to-analyze-twitter-data-with-hue/

https://github.com/cloudera/cdh-twitter-example

As an alternative to writing to HDFS, I’ve written a small prototype (available on GitHub), using Flume, to write the tweets to Hbase and then report directly in real-time via Impala.

If you wish to setup this prototype then:

1. Setup Hadoop and follow Cloudera’s Twitter example: setting up Flume and Twitter4J API to write tweets to HDFS:

---------------------------------------------------------------------------------------------------------------------------------------------

Cloudera’s Steps:

http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/

https://github.com/cloudera/cdh-twitter-example

Dan Sander (www.datadansandler.com) has also created a document and videos walking through the entire process in detail

http://www.datadansandler.com/2013/03/making-clouderas-twitter-stream-real.html

http://www.youtube.com/watch?v=2xO_8P09M38&list=PLPrplWpTfYTPU2topP8hJwpekrFj4wF8G

I found this a useful additional reference if I got stuck following Clouderas Steps.

2. Setup Flume to write to HBASE, Impala

---------------------------------------------------

https://github.com/AronMacDonald/Twitter_Hbase_Impala

Note: MyFlumeHbaseSinkcodewasinspiredbyDanSandler’sApacheWebLogFlumeHbaseexample

https://github.com/DataDanSandler/log_analysis

In Hbase you need to create a table to store the tweets:

sudo -u hdfs hbase shell

create 'tweets', {NAME => 'tweet'}, {NAME => 'retweeted_status'}, {NAME => 'entities'}, {NAME => 'user'}

In Impala you create a table linked to the HBASE table:

CREATE EXTERNAL TABLE HB_IMPALA_TWEETS (

id int,

id_str string,

text string,

created_at timestamp,

geo_latitude double,

geo_longitude double,

user_screen_name string,

user_location string,

user_followers_count string,

user_profile_image_url string

)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" =

":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude, user:screen_name,user:location,user:followers_count,user:profile_image_url"

)

TBLPROPERTIES("hbase.table.name" = "tweets");

For those that are interested in integrating with SAP HANA I’ve also added some logic in the Flume event to write a subset of fields to SAP HANA as well.

Further details on that are on an SAP blog

http://scn.sap.com/community/developer-center/hana/blog/2013/08/07/streaming-real-time-data-to-hadoop-and-hana

I’m still working on other parts of the prototype, to make use of the Tweet information both within Impala and HANA

Hopefully I’ll be able to share that as well if/when I get it working. 🙂

In the mean time I’ve recently seen 2 other examples of using Twitter data for Sentiment Analysis which may interest:

Hortonworks http://www.youtube.com/watch?feature=player_embedded&v=y3nFfsTnY3M

SAP HANA SCN http://scn.sap.com/community/developer-center/hana/blog/2013/06/19/real-time-sentiment-rating-of-movies-on-sap-hana-one

Re: Real-time Analysis of Twitter using Impala

Clint — Wed, 16 Oct 2013 19:23:17 GMT

Thanks for the post, Aron.

Re: Real-time Analysis of Twitter using Impala

Kulssaka — Mon, 23 Jun 2014 14:44:12 GMT

Hello,

When I try to link the external table to impala from hbase i get:

CREATE EXTERNAL TABLE HB_IMPALA_TWEETS (
> id int,
> id_str string,
> text string,
> created_at timestamp,
> geo_latitude double,
> geo_longitude double,
> user_screen_name string,
> user_location string,
> user_followers_count string,
> user_profile_image_url string
>
> )
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
> "hbase.columns.mapping" =
> ":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude, user:screen_name,user:location,user:followers_count,user:profile_image_url"
> )
> TBLPROPERTIES("hbase.table.name" = "tweets");
Query: create EXTERNAL TABLE HB_IMPALA_TWEETS ( id int, id_str string, text string, created_at timestamp, geo_latitude double, geo_longitude double, user_screen_name string, user_location string, user_followers_count string, user_profile_image_url string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude, user:screen_name,user:location,user:followers_count,user:profile_image_url" ) TBLPROPERTIES("hbase.table.name" = "tweets")
ERROR: AnalysisException: Syntax error in line 1:
...image_url string ) STORED BY 'org.apache.hadoop.hive.h...
^
Encountered: BY
Expected: AS

CAUSED BY: Exception: Syntax error

Any Idea why it is not working ? Do I need to add a JAR ?

Re: Real-time Analysis of Twitter using Impala

Kulssaka — Tue, 01 Jul 2014 09:18:06 GMT

We just need to use Hive to create the impala table...