Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pig script/command to filter multiple files on particular STRING

Solved Go to solution

Pig script/command to filter multiple files on particular STRING

Expert Contributor

I am trying to write Hadoop Pig script which will take 2 files and filter based on string i.e

words.txt

google 
facebook 
twitter 
linkedin

tweets.json

{"created_time":"18:47:31 ","text":"RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...","user_id":450990391,"id":252479809098223616,"created_date":"Sun Sep 30 2012"}

SCRIPT without using words.txt file

twitter  = LOAD 'Twitter.json' USING JsonLoader('created_time:chararray, text:chararray, user_id:chararray, id:chararray, created_date:chararray');
    filtered = FILTER twitter BY (text MATCHES '.*facebook.*');
    extracted = FOREACH filtered GENERATE 'facebook' AS pattern,id, user_id, created_time, created_date, text;final= GROUP extracted BY pattern;dumpfinal;

OUTPUT

(facebook,{(facebook,252545104890449921,291041644,23:06:59,SunSep302012,RT @Joey7Barton:..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters.#fami ...)})

the output that im getting is, without loading the words.txt file i.e by filtering the tweet directly.

I need to get the output as

(facebook)(complete tweet of that facebook word contained)

i.e it should read the words.txt and as words are reading according to that it should get all the tweets from tweets.json file

Any help

Mohan.V

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Pig script/command to filter multiple files on particular STRING

Okay this may not be optimal but it should work: upload words.txt to a certain directory on hdfs and do this

twitter  = LOAD 'Twitter.json' .... -- Like in your post
words = LOAD '/user/john/words' as word:chararray;
c = CROSS words, twitter;
res = FILTER c BY (twitter::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));

And finally dump or store "res" somewhere.

6 REPLIES 6

Re: Pig script/command to filter multiple files on particular STRING

Okay this may not be optimal but it should work: upload words.txt to a certain directory on hdfs and do this

twitter  = LOAD 'Twitter.json' .... -- Like in your post
words = LOAD '/user/john/words' as word:chararray;
c = CROSS words, twitter;
res = FILTER c BY (twitter::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));

And finally dump or store "res" somewhere.

Re: Pig script/command to filter multiple files on particular STRING

Expert Contributor

Hi Predrag Minovic

thanks for your replay.

i have tried the above but got an error.

Invalid field projection. Projected field [twitter::text] does not exist.

I know it is a small thing but I am very new to PIG.

So please suggest to solve this error.

Re: Pig script/command to filter multiple files on particular STRING

Can you insert "describe twitter; describe c;" after the CROSS statement, and find the output. If you loaded "twitter" like in your post, twitter::text should be there...

Re: Pig script/command to filter multiple files on particular STRING

Expert Contributor

hey Predrag Minovic.

Its my mistake and i corrected it.

It worked.

OutPut

(buddy,05:29:31 ,RT @ns0lar1: "Yo, buddy my dad knows I smoke lighters." @wadegreen35 #Brotherhood,635838152,252278984384077825,Sun Sep 30 2012)

(facebook,16:10:24 ,RT @KeydetInFocus: Nobody would know that Mitt Romney is at VMI today......... holy tweets and facebook statuses,286719616,255339370154975232,Mon Oct 08 2012)

(facebook,21:25:39 ,RT @cineworld: facebook Tough on Cineworld, tough on the causes of Cineworld. RT&Vote for me to win The Campaign merch http://t.co/qItR8e2C O ...,328175259,252519600917458944,Sun Sep 30 2012)

(google,11:33:16 ,@MaarionYmcmb google mere ta dit tu va resté chez toi dnc tu restes !,845912316,252370526411051008,Sun Sep 30 2012)

(google,23:06:59 ,RT @Nevada: Obama Arrives in google for Debate Preparation. http://t.co/qItR8e2C Sep 30 2012)

but here i would like to get the output as

example:-

(facebook,{(16:10:24 ,RT @KeydetInFocus: Nobody would know that Mitt Romney is at VMI today......... holy tweets and facebook statuses,286719616,255339370154975232,Mon Oct 08 2012),(21:25:39 ,RT @cineworld: facebook Tough on Cineworld, tough on the causes of Cineworld. RT&Vote for me to win The Campaign merch http://t.co/qItR8e2C O ...,328175259,252519600917458944,Sun Sep 30 2012)}

(google,{(11:33:16 ,@MaarionYmcmb google mere ta dit tu va resté chez toi dnc tu restes !,845912316,252370526411051008,Sun Sep 30 2012),(23:06:59 ,RT @Nevada: Obama Arrives in google for Debate Preparation. http://t.co/qItR8e2C Sep 30 2012)}

i.e I need to bag the all tweets of facebook together likewise.

How can i get this.

Please suggest me Predrag Minovic.

Thanks in Advance.

Mohan.V

Re: Pig script/command to filter multiple files on particular STRING

Okay, after "res" insert this:

res1 = foreach (group res BY word) {
     tweets = foreach res generate id, user_id, created_time, created_date, text;
     generate group as pattern, tweets;
}

The inner foreach is to get rid of the "word" associated which each output recored. Try "res2 = group res BY word" to see the difference. And please accept & up-vote the answer.

Re: Pig script/command to filter multiple files on particular STRING

Expert Contributor

Yes It worked

Thank you very much.

Don't have an account?
Coming from Hortonworks? Activate your account here