Support Questions

mohan221213 · ‎09-10-2016

I am new to pig.trying filter the text file and store it in hbase

here is the sample input file

sample.txt

{"pattern":"google_1473491793_265244074740","tweets":[{"tweet::created_at":"18:47:31 ","tweet::id":"252479809098223616","tweet::user_id":"450990391","tweet::text":"rt @joey7barton: ..give a google about whether the americans wins a ryder cup. i mean surely he has slightly more important matters. #fami ..."}]}
{"pattern":"facebook_1473491793_265244074740","tweets":[{"tweet::created_at":"11:33:16 ","tweet::id":"252370526411051008","tweet::user_id":"845912316","tweet::text":"@maarionymcmb facebook mere ta dit tu va resté chez toi dnc tu restes !"}]}

Script:-

data = load 'sample.txt' using JsonLoader('pattern:chararray, tweets:  bag {t1:tuple(tweet::created_at: chararray,tweet::id: chararray,tweet::user_id: chararray,tweet::text: chararray)}');
A = FILTER data BY pattern == 'google_*';
grouped = foreach (group A by pattern){tweets1 = foreach data generate tweets.(created_at),tweets.(id),tweets.(user_id),tweets.(text); generate group as pattern1,tweets1;}

But i got the error when run grouped

2016-09-10 13:38:52,995 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 41, column 57> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)

Please correct me what i am doung wrong.

thank you

mohan221213 · ‎09-11-2016

I think i got it on my own.

As gkeys said, i made it too complex.

But at last I have realized that I don't need the 3rd step which is grouping, and it is successfully stored into the hbase.

Here is the Script:-

data = load 'sample.txt' using JsonLoader('pattern:chararray, tweets:  bag {(tweet::created_at: chararray,tweet::id: chararray,tweet::user_id: chararray,tweet::text: chararray)}');
A = FILTER data BY pattern =='google_*';
STORE A into 'hbase://tablename' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('tweets:tweets');

View solution in original post

gkeys · ‎09-10-2016

There is a lot going on here -- when writing a complex script like this, the following approach is useful to build and debug:

run locally against a small subset of records (pig -x local -f <scriptOnLocalFileSystem>.pig). This makes each instance of the script run faster.
build each statement line by line until you get to the failure statement (run the first statement, add the second and run, etc until it fails). When it fails you need to focus on the last statement and fix it.
These steps are good for finding grammar issues (which it looks like you have based on the error statement). If you also want to make sure your data is being processed correctly, put a DUMP statement after each line during each iteration. That way you can inspect the results of each statement
If using inline statements like your grouped = statement, separate out at first until it works. This makes the issue easier to isolate.

Let me know how that goes.

mohan221213 · ‎09-10-2016

Thank you for your valuable suggestions gkeys.

I didnt expected that it will beome a complex script like this.

As i said that I am just a beginner in Pig.

So please suggest me solve for the same.

mohan221213 · ‎09-11-2016

I think i got it on my own.

As gkeys said, i made it too complex.

But at last I have realized that I don't need the 3rd step which is grouping, and it is successfully stored into the hbase.

Here is the Script:-

data = load 'sample.txt' using JsonLoader('pattern:chararray, tweets:  bag {(tweet::created_at: chararray,tweet::id: chararray,tweet::user_id: chararray,tweet::text: chararray)}');
A = FILTER data BY pattern =='google_*';
STORE A into 'hbase://tablename' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('tweets:tweets');

gkeys · ‎09-11-2016

@Mohan V

Very glad to see you solved it yourself by debugging -- it is the best way to learn and improve your skills 🙂

Cloudera Community

Support Questions

PIG script Error