Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Pig script/command to filter multiple files on particular STRING

Expert Contributor

I am trying to write Hadoop Pig script which will take 2 files and filter based on string i.e

words.txt

google 
facebook 
twitter 
linkedin

tweets.json

{"created_time":"18:47:31 ","text":"RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...","user_id":450990391,"id":252479809098223616,"created_date":"Sun Sep 30 2012"}

SCRIPT without using words.txt file

twitter  = LOAD 'Twitter.json' USING JsonLoader('created_time:chararray, text:chararray, user_id:chararray, id:chararray, created_date:chararray');
    filtered = FILTER twitter BY (text MATCHES '.*facebook.*');
    extracted = FOREACH filtered GENERATE 'facebook' AS pattern,id, user_id, created_time, created_date, text;final= GROUP extracted BY pattern;dumpfinal;

OUTPUT

(facebook,{(facebook,252545104890449921,291041644,23:06:59,SunSep302012,RT @Joey7Barton:..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters.#fami ...)})

the output that im getting is, without loading the words.txt file i.e by filtering the tweet directly.

I need to get the output as

(facebook)(complete tweet of that facebook word contained)

i.e it should read the words.txt and as words are reading according to that it should get all the tweets from tweets.json file

Any help

Mohan.V

1 ACCEPTED SOLUTION

Okay this may not be optimal but it should work: upload words.txt to a certain directory on hdfs and do this

twitter  = LOAD 'Twitter.json' .... -- Like in your post
words = LOAD '/user/john/words' as word:chararray;
c = CROSS words, twitter;
res = FILTER c BY (twitter::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));

And finally dump or store "res" somewhere.

View solution in original post

6 REPLIES 6

Okay this may not be optimal but it should work: upload words.txt to a certain directory on hdfs and do this

twitter  = LOAD 'Twitter.json' .... -- Like in your post
words = LOAD '/user/john/words' as word:chararray;
c = CROSS words, twitter;
res = FILTER c BY (twitter::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));

And finally dump or store "res" somewhere.

Expert Contributor

Hi Predrag Minovic

thanks for your replay.

i have tried the above but got an error.

Invalid field projection. Projected field [twitter::text] does not exist.

I know it is a small thing but I am very new to PIG.

So please suggest to solve this error.

Can you insert "describe twitter; describe c;" after the CROSS statement, and find the output. If you loaded "twitter" like in your post, twitter::text should be there...

Expert Contributor

hey Predrag Minovic.

Its my mistake and i corrected it.

It worked.

OutPut

(buddy,05:29:31 ,RT @ns0lar1: "Yo, buddy my dad knows I smoke lighters." @wadegreen35 #Brotherhood,635838152,252278984384077825,Sun Sep 30 2012)

(facebook,16:10:24 ,RT @KeydetInFocus: Nobody would know that Mitt Romney is at VMI today......... holy tweets and facebook statuses,286719616,255339370154975232,Mon Oct 08 2012)

(facebook,21:25:39 ,RT @cineworld: facebook Tough on Cineworld, tough on the causes of Cineworld. RT&Vote for me to win The Campaign merch http://t.co/qItR8e2C O ...,328175259,252519600917458944,Sun Sep 30 2012)

(google,11:33:16 ,@MaarionYmcmb google mere ta dit tu va resté chez toi dnc tu restes !,845912316,252370526411051008,Sun Sep 30 2012)

(google,23:06:59 ,RT @Nevada: Obama Arrives in google for Debate Preparation. http://t.co/qItR8e2C Sep 30 2012)

but here i would like to get the output as

example:-

(facebook,{(16:10:24 ,RT @KeydetInFocus: Nobody would know that Mitt Romney is at VMI today......... holy tweets and facebook statuses,286719616,255339370154975232,Mon Oct 08 2012),(21:25:39 ,RT @cineworld: facebook Tough on Cineworld, tough on the causes of Cineworld. RT&Vote for me to win The Campaign merch http://t.co/qItR8e2C O ...,328175259,252519600917458944,Sun Sep 30 2012)}

(google,{(11:33:16 ,@MaarionYmcmb google mere ta dit tu va resté chez toi dnc tu restes !,845912316,252370526411051008,Sun Sep 30 2012),(23:06:59 ,RT @Nevada: Obama Arrives in google for Debate Preparation. http://t.co/qItR8e2C Sep 30 2012)}

i.e I need to bag the all tweets of facebook together likewise.

How can i get this.

Please suggest me Predrag Minovic.

Thanks in Advance.

Mohan.V

Okay, after "res" insert this:

res1 = foreach (group res BY word) {
     tweets = foreach res generate id, user_id, created_time, created_date, text;
     generate group as pattern, tweets;
}

The inner foreach is to get rid of the "word" associated which each output recored. Try "res2 = group res BY word" to see the difference. And please accept & up-vote the answer.

Expert Contributor

Yes It worked

Thank you very much.