- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Pig script/command to filter multiple files on particular STRING
- Labels:
-
Apache Hadoop
-
Apache Pig
Created ‎09-01-2016 08:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to write Hadoop Pig script which will take 2 files and filter based on string i.e
words.txt
google
facebook
twitter
linkedin
tweets.json
{"created_time":"18:47:31 ","text":"RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...","user_id":450990391,"id":252479809098223616,"created_date":"Sun Sep 30 2012"}
SCRIPT without using words.txt file
twitter = LOAD 'Twitter.json' USING JsonLoader('created_time:chararray, text:chararray, user_id:chararray, id:chararray, created_date:chararray');
filtered = FILTER twitter BY (text MATCHES '.*facebook.*');
extracted = FOREACH filtered GENERATE 'facebook' AS pattern,id, user_id, created_time, created_date, text;final= GROUP extracted BY pattern;dumpfinal;
OUTPUT
(facebook,{(facebook,252545104890449921,291041644,23:06:59,SunSep302012,RT @Joey7Barton:..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters.#fami ...)})
the output that im getting is, without loading the words.txt file i.e by filtering the tweet directly.
I need to get the output as
(facebook)(complete tweet of that facebook word contained)
i.e it should read the words.txt and as words are reading according to that it should get all the tweets from tweets.json file
Any help
Mohan.V
Created ‎09-01-2016 12:31 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Okay this may not be optimal but it should work: upload words.txt to a certain directory on hdfs and do this
twitter = LOAD 'Twitter.json' .... -- Like in your post words = LOAD '/user/john/words' as word:chararray; c = CROSS words, twitter; res = FILTER c BY (twitter::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));
And finally dump or store "res" somewhere.
Created ‎09-01-2016 12:31 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Okay this may not be optimal but it should work: upload words.txt to a certain directory on hdfs and do this
twitter = LOAD 'Twitter.json' .... -- Like in your post words = LOAD '/user/john/words' as word:chararray; c = CROSS words, twitter; res = FILTER c BY (twitter::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));
And finally dump or store "res" somewhere.
Created ‎09-02-2016 05:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Predrag Minovic
thanks for your replay.
i have tried the above but got an error.
Invalid field projection. Projected field [twitter::text] does not exist.
I know it is a small thing but I am very new to PIG.
So please suggest to solve this error.
Created ‎09-02-2016 05:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you insert "describe twitter; describe c;" after the CROSS statement, and find the output. If you loaded "twitter" like in your post, twitter::text should be there...
Created ‎09-02-2016 06:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hey Predrag Minovic.
Its my mistake and i corrected it.
It worked.
OutPut
(buddy,05:29:31 ,RT @ns0lar1: "Yo, buddy my dad knows I smoke lighters." @wadegreen35 #Brotherhood,635838152,252278984384077825,Sun Sep 30 2012)
(facebook,16:10:24 ,RT @KeydetInFocus: Nobody would know that Mitt Romney is at VMI today......... holy tweets and facebook statuses,286719616,255339370154975232,Mon Oct 08 2012)
(facebook,21:25:39 ,RT @cineworld: facebook Tough on Cineworld, tough on the causes of Cineworld. RT&Vote for me to win The Campaign merch http://t.co/qItR8e2C O ...,328175259,252519600917458944,Sun Sep 30 2012)
(google,11:33:16 ,@MaarionYmcmb google mere ta dit tu va resté chez toi dnc tu restes !,845912316,252370526411051008,Sun Sep 30 2012)
(google,23:06:59 ,RT @Nevada: Obama Arrives in google for Debate Preparation. http://t.co/qItR8e2C Sep 30 2012)
but here i would like to get the output as
example:-
(facebook,{(16:10:24 ,RT @KeydetInFocus: Nobody would know that Mitt Romney is at VMI today......... holy tweets and facebook statuses,286719616,255339370154975232,Mon Oct 08 2012),(21:25:39 ,RT @cineworld: facebook Tough on Cineworld, tough on the causes of Cineworld. RT&Vote for me to win The Campaign merch http://t.co/qItR8e2C O ...,328175259,252519600917458944,Sun Sep 30 2012)}
(google,{(11:33:16 ,@MaarionYmcmb google mere ta dit tu va resté chez toi dnc tu restes !,845912316,252370526411051008,Sun Sep 30 2012),(23:06:59 ,RT @Nevada: Obama Arrives in google for Debate Preparation. http://t.co/qItR8e2C Sep 30 2012)}
i.e I need to bag the all tweets of facebook together likewise.
How can i get this.
Please suggest me Predrag Minovic.
Thanks in Advance.
Mohan.V
Created ‎09-02-2016 06:39 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Okay, after "res" insert this:
res1 = foreach (group res BY word) { tweets = foreach res generate id, user_id, created_time, created_date, text; generate group as pattern, tweets; }
The inner foreach is to get rid of the "word" associated which each output recored. Try "res2 = group res BY word" to see the difference. And please accept & up-vote the answer.
Created ‎09-02-2016 07:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes It worked
Thank you very much.
