- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
top function in pig/hive
- Labels:
-
Apache Hive
-
Apache Pig
Created ‎02-04-2016 08:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In a dataset (approx. 2 lakh records), there is coloumn named tags ( comma separated list of tags associated with question. examples of tags are "html","error" etc so on .
php,error,gd,image-processing
php,error,gd,image-processing
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
cocoa-touch,objective-c,design-patterns
cocoa-touch,objective-c,design-patterns
cocoa-touch,objective-c,design-patterns
core-animation
django,django-models
django,django-models
aspûnet
scala,pattern-matching,oop,object-oriented-design,design-principles
scala,pattern-matching,oop,object-oriented-design,design-principles
scala,pattern-matching,oop,object-oriented-design,design-principles
. . . . .
how to find top 10 most commonly used tags in dataset? in pig or hive
Created ‎02-04-2016 09:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a Pig word count with comments. Give the delimiter to TOKENIZE, in you case comma: TOKENIZE(line,','). You might have to select a different filter based on your input. You can start by commenting the filter out and adding it later if needed. Finally, to extract only 10 top entries you can use LIMIT: top10 = LIMIT ordered_word_count, 10. Be sure to inspect the stored file and make sure words (tags) have been properly tokenized. If not, add a filter mentioned above.
Created ‎02-04-2016 08:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@priyanka vijayakumar good word count tutorial link. It uses Pig, Hcatalog and Hive, you will be better off with the combination of these.
Created ‎02-05-2016 11:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks a lot.
Created ‎02-04-2016 09:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a Pig word count with comments. Give the delimiter to TOKENIZE, in you case comma: TOKENIZE(line,','). You might have to select a different filter based on your input. You can start by commenting the filter out and adding it later if needed. Finally, to extract only 10 top entries you can use LIMIT: top10 = LIMIT ordered_word_count, 10. Be sure to inspect the stored file and make sure words (tags) have been properly tokenized. If not, add a filter mentioned above.
Created ‎02-05-2016 11:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks a lot.
