Created 08-30-2016 12:17 PM
Hi,
I am trying to process a sample tweet and get the complete tweet by filtering text on a particular word.
I have used following script for the same.
1:- twitter = LOAD 'sample.json' USING JsonLoader('coordinates:map[], created_at:chararray, entities:map[], favorited:chararray,id:int,favorite_count:int, id_str:chararray,metadata:map[], in_reply_to_screen_name:chararray, in_reply_to_status_id_str:chararray, place:map[], possibly_sensitive:chararray, retweet_count:int, source:chararray, text:chararray, truncated:chararray, user:map[], withheld_in_countries:{t:(country:chararray)}');
2:- filtered = FILTER twitter BY (text MATCHES '.*word.*');
3:- extracted = FOREACH filtered GENERATE text, id;
4:- dump etracted;
When i ran the script it has successfully done by showing the success at the end.
But there is no output and also i found something like
2016-08-30 17:40:22,054 [LocalJobRunner Map Task Executor #0] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.JsonLoader(UDF_WARNING_1): Bad record, could not find start of record 2016-08-30 17:40:22,054 [LocalJobRunner Map Task Executor #0] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.JsonLoader(UDF_WARNING_1): Bad map field, could not find start of object, field 2 2016-08-30 17:40:22,054 [LocalJobRunner Map Task Executor #0] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.JsonLoader(UDF_WARNING_1): Bad record, returning null for {"cordinates":{"type":"Point","coordinates":["-82.695728","38.502019"]},"created_at":"Wed May 29 15:47:17 +0000 2013","current_user_retweet":null,"entities":{"hashtags":[{"indices":["64","73"],"text":"palecity"}],"symbols":[],"urls":[{"expanded_url":"http://path.com/p/2OpKGV","indices":["103","125"],"display_url":"path.com/p/2OpKGV","url":"http://t.co/s4X71J1xEv"}],"user_mentions":[]},"favorited":"false","id_str":"339769924257988608","in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"place":{"id":"14bdb1a6511724ec","place_type":"city","bounding_box":{"type":"Polygon","coordinates":[[["-82.735154","38.485755"],["-82.735154","38.545196"],["-82.674361","38.545196"],["-82.674361","38.485755"]]]},"name":"Russell","attributes":{},"country_code":"US","url":"http://api.twitter.com/1/geo/id/14bdb1a6511724ec.json","full_name":"Russell, KY","country":"United States"},"possibly_sensitive":"false","retweet_count":0,"source":"<a href=\"https://path.com/\" rel=\"nofollow\">Path</a>","text":"Getting ready word to lay out poolside for the first time this year! #palecity (at Jeff's Big Deck-South) — http://t.co/s4X71J1xEv","truncated":"false","user":{"location":"AshRussFonte, KY","default_profile":"false","profile_background_tile":"false","statuses_count":"375","lang":"en","profile_link_color":"0084B4","profile_banner_url":"https://pbs.twimg.com/profile_banners/32952244/1348409088","id":"32952244","following":null,"protected":"false","favourites_count":"60","profile_text_color":"333333","contributors_enabled":"false","verified":"false","description":"Wife, daughter, SLP and celebrity gossip enthusiast.","name":"Beth ","profile_sidebar_border_color":"C0DEED","profile_background_color":"C0DEED","created_at":"Sat Apr 18 17:36:45 +0000 2009","default_profile_image":"false","followers_count":"19","geo_enabled":"true","profile_image_url_https":"https://si0.twimg.com/profile_images/3624785207/ede699f51f98b4da3ee700da3a7ed973_normal.jpeg","profile_background_image_url":"http://a0.twimg.com/profile_background_images/75611058/599663425_xB7Dp-M.jpg","profile_background_image_url_https":"https://si0.twimg.com/profile_background_images/75611058/599663425_xB7Dp-M.jpg","follow_request_sent":null,"url":null,"utc_offset":"-18000","time_zone":"Eastern Time (US & Canada)","notifications":null,"friends_count":"293","profile_use_background_image":"true","profile_sidebar_fill_color":"DDEEF6","screen_name":"BBS610","id_str":"32952244","profile_image_url":"http://a0.twimg.com/profile_images/3624785207/ede699f51f98b4da3ee700da3a7ed973_normal.jpeg","is_translator":"false","listed_count":"0"},"withheld_copyright":null,"withheld_in_countries":null,"withheld_scope":null}
I think the json that i have loaded is not a well formatted. please surges me for the above.
thank you.
Mohan.V
Created 09-15-2016 06:39 AM
I got it on my own
I think it is because of the difference versions that i have used in my script.
When i used the same versions of elephant bird then it worked fine for me as suggested by @gkeys.
script:-
REGISTER elephant-bird-core-4.1.jar REGISTER elephant-bird-hadoop-compat-4.1.jar REGISTER elephant-bird-pig-4.1.jar REGISTER json-simple-1.1.1.jar twitter = LOAD 'sample.json' USING com.twitter.elephantbird.pig.load.JsonLoader(); extracted =foreach twitter generate (chararray)$0#'created_at' as created_at,(chararray)$0#'id' as id,(chararray)$0#'id_str' as id_str,(chararray)$0#'text' as text,(chararray)$0#'source' as source,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'entities') as entities,(boolean)$0#'favorited' as favorited,(long)$0#'favorite_count' as favorite_count,(long)$0#'retweet_count' as retweet_count,(boolean)$0#'retweeted' as retweeted,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'place') as place; dump extracted;
And it worked fine.
Created 08-31-2016 08:49 AM
The built-in JsonLoader has a somewhat limited functionality and expects all entries (tweets) to have the same order of elements as given in the Pig schema. So, first make sure this condition is satisfied. For example, you have in your schema "id:int" but in the record returned by warnings you don't have an integer element at that position. Also, element names are not preserved, Pig takes them one by one as given in the input, so you can as well name them a, b, c, ... You may also wish to try Elephant Bird JsonLoader which has more advanced features.
Created 09-09-2016 03:15 AM
Thanks for your reply Predrag Minovic.
I have tried by using Elephant Bird JsonLoader.
script:
REGISTER piggybank.jar REGISTER json-simple-1.1.1.jar REGISTER elephant-bird-pig-4.3.jar REGISTER elephant-bird-core-4.1.jar REGISTER elephant-bird-hadoop-compat-4.3.jar json = LOAD 'sample.json' USING com.twitter.elephantbird.pig.load.JsonLoader('created_at:chararray, id:chararray, id_str:chararray, text:chararray, source:chararray, in_reply_to_status_id:chararray, in_reply_to_status_id_str:chararray, in_reply_to_user_id:chararray, in_reply_to_user_id_str:chararray, in_reply_to_screen_name:chararray, geo:chararray, coordinates:chararray, place:chararray, contributors:chararray, is_quote_status:bytearray, retweet_count:long, favorite_count:chararray, entities:map[], favorited:bytearray, retweeted:bytearray, possibly_sensitive:bytearray, lang:chararray'); describe json Schema for json unknown.
Please suggest me.
Created 09-15-2016 06:39 AM
I got it on my own
I think it is because of the difference versions that i have used in my script.
When i used the same versions of elephant bird then it worked fine for me as suggested by @gkeys.
script:-
REGISTER elephant-bird-core-4.1.jar REGISTER elephant-bird-hadoop-compat-4.1.jar REGISTER elephant-bird-pig-4.1.jar REGISTER json-simple-1.1.1.jar twitter = LOAD 'sample.json' USING com.twitter.elephantbird.pig.load.JsonLoader(); extracted =foreach twitter generate (chararray)$0#'created_at' as created_at,(chararray)$0#'id' as id,(chararray)$0#'id_str' as id_str,(chararray)$0#'text' as text,(chararray)$0#'source' as source,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'entities') as entities,(boolean)$0#'favorited' as favorited,(long)$0#'favorite_count' as favorite_count,(long)$0#'retweet_count' as retweet_count,(boolean)$0#'retweeted' as retweeted,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'place') as place; dump extracted;
And it worked fine.