As part of this community i would like to have some help from you all.
The issue is the following.-
I've extracted data from twitter and stored it into an external hive table.
What i want to show is a study of common words used by users who wrote something about a specific subject.
Big part of the results will lead us to understan what is the receivement that people has about the subject that we were looking for.
Any idea to complete this task?
The first step of getting the data into hadoop is done.
Now this problem can be tackled by a multiple ways in hadoop.
In hive take a look at this:
You can also choose to use spark here.
Let us know how you progress. All the best.
Hive provides few stats & data mining functions like - ngrams() & context_ngrams().
ngrams() would simply give you the x most frequent words in one or more sequences
context_ngrams() extend the ngrams() feature and allows you to add a context to your mining i.e., in your case a 'subject'.
You could also refer to the section on "Analyze Tweet data in Hive" in this Hortonworks Tutorial and modify the queries to suit your requirements.
You can create a HiveContext instance in Spark(using Scala) like this:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(SparkContext)
Then define your Hive query as:
You could refer to this tutorial to see the above use of HiveContext.