Created 09-01-2016 08:24 AM
I am trying to write Hadoop Pig script which will take 2 files and filter based on string i.e
words.txt
google
facebook
twitter
linkedin
tweets.json
{"created_time":"18:47:31 ","text":"RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...","user_id":450990391,"id":252479809098223616,"created_date":"Sun Sep 30 2012"}
SCRIPT without using words.txt file
twitter = LOAD 'Twitter.json' USING JsonLoader('created_time:chararray, text:chararray, user_id:chararray, id:chararray, created_date:chararray');
filtered = FILTER twitter BY (text MATCHES '.*facebook.*');
extracted = FOREACH filtered GENERATE 'facebook' AS pattern,id, user_id, created_time, created_date, text;final= GROUP extracted BY pattern;dumpfinal;
OUTPUT
(facebook,{(facebook,252545104890449921,291041644,23:06:59,SunSep302012,RT @Joey7Barton:..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters.#fami ...)})
the output that im getting is, without loading the words.txt file i.e by filtering the tweet directly.
I need to get the output as
(facebook)(complete tweet of that facebook word contained)
i.e it should read the words.txt and as words are reading according to that it should get all the tweets from tweets.json file
Any help
Mohan.V
Created 09-01-2016 12:31 PM
Okay this may not be optimal but it should work: upload words.txt to a certain directory on hdfs and do this
twitter = LOAD 'Twitter.json' .... -- Like in your post words = LOAD '/user/john/words' as word:chararray; c = CROSS words, twitter; res = FILTER c BY (twitter::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));
And finally dump or store "res" somewhere.
Created 09-01-2016 12:31 PM
Okay this may not be optimal but it should work: upload words.txt to a certain directory on hdfs and do this
twitter = LOAD 'Twitter.json' .... -- Like in your post words = LOAD '/user/john/words' as word:chararray; c = CROSS words, twitter; res = FILTER c BY (twitter::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));
And finally dump or store "res" somewhere.
Created 09-02-2016 05:18 AM
Hi Predrag Minovic
thanks for your replay.
i have tried the above but got an error.
Invalid field projection. Projected field [twitter::text] does not exist.
I know it is a small thing but I am very new to PIG.
So please suggest to solve this error.
Created 09-02-2016 05:27 AM
Can you insert "describe twitter; describe c;" after the CROSS statement, and find the output. If you loaded "twitter" like in your post, twitter::text should be there...
Created 09-02-2016 06:09 AM
hey Predrag Minovic.
Its my mistake and i corrected it.
It worked.
OutPut
(buddy,05:29:31 ,RT @ns0lar1: "Yo, buddy my dad knows I smoke lighters." @wadegreen35 #Brotherhood,635838152,252278984384077825,Sun Sep 30 2012)
(facebook,16:10:24 ,RT @KeydetInFocus: Nobody would know that Mitt Romney is at VMI today......... holy tweets and facebook statuses,286719616,255339370154975232,Mon Oct 08 2012)
(facebook,21:25:39 ,RT @cineworld: facebook Tough on Cineworld, tough on the causes of Cineworld. RT&Vote for me to win The Campaign merch http://t.co/qItR8e2C O ...,328175259,252519600917458944,Sun Sep 30 2012)
(google,11:33:16 ,@MaarionYmcmb google mere ta dit tu va resté chez toi dnc tu restes !,845912316,252370526411051008,Sun Sep 30 2012)
(google,23:06:59 ,RT @Nevada: Obama Arrives in google for Debate Preparation. http://t.co/qItR8e2C Sep 30 2012)
but here i would like to get the output as
example:-
(facebook,{(16:10:24 ,RT @KeydetInFocus: Nobody would know that Mitt Romney is at VMI today......... holy tweets and facebook statuses,286719616,255339370154975232,Mon Oct 08 2012),(21:25:39 ,RT @cineworld: facebook Tough on Cineworld, tough on the causes of Cineworld. RT&Vote for me to win The Campaign merch http://t.co/qItR8e2C O ...,328175259,252519600917458944,Sun Sep 30 2012)}
(google,{(11:33:16 ,@MaarionYmcmb google mere ta dit tu va resté chez toi dnc tu restes !,845912316,252370526411051008,Sun Sep 30 2012),(23:06:59 ,RT @Nevada: Obama Arrives in google for Debate Preparation. http://t.co/qItR8e2C Sep 30 2012)}
i.e I need to bag the all tweets of facebook together likewise.
How can i get this.
Please suggest me Predrag Minovic.
Thanks in Advance.
Mohan.V
Created 09-02-2016 06:39 AM
Okay, after "res" insert this:
res1 = foreach (group res BY word) { tweets = foreach res generate id, user_id, created_time, created_date, text; generate group as pattern, tweets; }
The inner foreach is to get rid of the "word" associated which each output recored. Try "res2 = group res BY word" to see the difference. And please accept & up-vote the answer.
Created 09-02-2016 07:16 AM
Yes It worked
Thank you very much.