Support Questions

Find answers, ask questions, and share your expertise

top function in pig/hive

avatar
Contributor

In a dataset (approx. 2 lakh records), there is coloumn named tags ( comma separated list of tags associated with question. examples of tags are "html","error" etc so on .

php,error,gd,image-processing

php,error,gd,image-processing

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

lisp,scheme,subjective,clojure

cocoa-touch,objective-c,design-patterns

cocoa-touch,objective-c,design-patterns

cocoa-touch,objective-c,design-patterns

core-animation

django,django-models

django,django-models

aspûnet

scala,pattern-matching,oop,object-oriented-design,design-principles

scala,pattern-matching,oop,object-oriented-design,design-principles

scala,pattern-matching,oop,object-oriented-design,design-principles

. . . . .

how to find top 10 most commonly used tags in dataset? in pig or hive

1 ACCEPTED SOLUTION

avatar
Master Guru

Here is a Pig word count with comments. Give the delimiter to TOKENIZE, in you case comma: TOKENIZE(line,','). You might have to select a different filter based on your input. You can start by commenting the filter out and adding it later if needed. Finally, to extract only 10 top entries you can use LIMIT: top10 = LIMIT ordered_word_count, 10. Be sure to inspect the stored file and make sure words (tags) have been properly tokenized. If not, add a filter mentioned above.

View solution in original post

4 REPLIES 4

avatar
Master Mentor

@priyanka vijayakumar good word count tutorial link. It uses Pig, Hcatalog and Hive, you will be better off with the combination of these.

avatar
Contributor

thanks a lot.

avatar
Master Guru

Here is a Pig word count with comments. Give the delimiter to TOKENIZE, in you case comma: TOKENIZE(line,','). You might have to select a different filter based on your input. You can start by commenting the filter out and adding it later if needed. Finally, to extract only 10 top entries you can use LIMIT: top10 = LIMIT ordered_word_count, 10. Be sure to inspect the stored file and make sure words (tags) have been properly tokenized. If not, add a filter mentioned above.

avatar
Contributor

thanks a lot.