Created 02-04-2016 08:46 PM
In a dataset (approx. 2 lakh records), there is coloumn named tags ( comma separated list of tags associated with question. examples of tags are "html","error" etc so on .
php,error,gd,image-processing
php,error,gd,image-processing
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
cocoa-touch,objective-c,design-patterns
cocoa-touch,objective-c,design-patterns
cocoa-touch,objective-c,design-patterns
core-animation
django,django-models
django,django-models
aspûnet
scala,pattern-matching,oop,object-oriented-design,design-principles
scala,pattern-matching,oop,object-oriented-design,design-principles
scala,pattern-matching,oop,object-oriented-design,design-principles
. . . . .
how to find top 10 most commonly used tags in dataset? in pig or hive
Created 02-04-2016 09:51 PM
Here is a Pig word count with comments. Give the delimiter to TOKENIZE, in you case comma: TOKENIZE(line,','). You might have to select a different filter based on your input. You can start by commenting the filter out and adding it later if needed. Finally, to extract only 10 top entries you can use LIMIT: top10 = LIMIT ordered_word_count, 10. Be sure to inspect the stored file and make sure words (tags) have been properly tokenized. If not, add a filter mentioned above.
Created 02-04-2016 08:50 PM
@priyanka vijayakumar good word count tutorial link. It uses Pig, Hcatalog and Hive, you will be better off with the combination of these.
Created 02-05-2016 11:44 PM
thanks a lot.
Created 02-04-2016 09:51 PM
Here is a Pig word count with comments. Give the delimiter to TOKENIZE, in you case comma: TOKENIZE(line,','). You might have to select a different filter based on your input. You can start by commenting the filter out and adding it later if needed. Finally, to extract only 10 top entries you can use LIMIT: top10 = LIMIT ordered_word_count, 10. Be sure to inspect the stored file and make sure words (tags) have been properly tokenized. If not, add a filter mentioned above.
Created 02-05-2016 11:44 PM
thanks a lot.