Created 11-15-2016 05:13 PM
I have a pig script where I want to split an alias depending on some conditions. The script is:
SPLIT i1 INTO top IF((user_followers_count > i3.avg_user_followers_count) AND (avl_user_engagements > i3.avg_avl_user_engagements)), bot IF((user_followers_count < i3.avg_user_followers_count) AND (avl_user_engagements < i3.avg_avl_user_engagements)), med OTHERWISE;
But looking at the output, I think the code is not working as expected.
Input(s): Successfully read 165 records (362590 bytes) from: "/tmp/user" Output(s): Successfully stored 0 records (815 bytes) in: "/tmp/inf_med" Successfully stored 1 records (163 bytes) in: "/tmp/inf_top" Successfully stored 113 records (18068 bytes) in: "/tmp/inf_bot"
There are about 50 records that have to go into the med category.
My pig version is: 0.15.0.2.4.2.0-258
HDP: 2.4.2.0-258
Created 11-15-2016 05:41 PM
Otherwise was introduced in pig 0.10 and is a very solid feature. There should not be an issue with it.
Could you:
You can add to these comments
Created 11-15-2016 05:41 PM
Otherwise was introduced in pig 0.10 and is a very solid feature. There should not be an issue with it.
Could you:
You can add to these comments
Created 11-15-2016 05:52 PM
1. My full script:
register /usr/hdp/current/pig-client/lib/piggybank.jar register /opt/elephantbird-jars/elephant-bird-core-4.5.jar register /opt/elephantbird-jars/elephant-bird-hadoop-compat-4.5.jar register /opt/elephantbird-jars/elephant-bird-pig-4.5.jar register /opt/elephantbird-jars/json-simple-1.1.1.jar data_input_2 = LOAD '/tmp/user' USING com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map []); i1 = FOREACH data_input_2 GENERATE json#'user_id' as user_id, (int)json#'user_followers_count' as user_followers_count, (int)json#'avl_user_total_retweets' as avl_user_total_retweets, (int)json#'avl_user_total_likes' as avl_user_total_likes, (int)json#'avl_user_total_replies' as avl_user_total_replies, (int)json#'avl_user_engagements' as avl_user_engagements; i2 = GROUP i1 ALL; i3 = FOREACH i2 GENERATE AVG(i1.user_followers_count) AS avg_user_followers_count, AVG(i1.avl_user_total_retweets) AS avg_avl_user_total_retweets, AVG(i1.avl_user_total_likes) AS avg_avl_user_total_likes, AVG(i1.avl_user_total_replies) AS avg_avl_user_total_replies, AVG(i1.avl_user_engagements) AS avg_avl_user_engagements; SPLIT i1 INTO top IF((user_followers_count > i3.avg_user_followers_count) AND (avl_user_engagements > i3.avg_avl_user_engagements) AND (avl_user_total_retweets > i3.avg_avl_user_total_retweets) AND (avl_user_total_likes > i3.avg_avl_user_total_likes) AND (avl_user_total_replies > i3.avg_avl_user_total_replies)), bot IF((user_followers_count < i3.avg_user_followers_count) AND (avl_user_engagements < i3.avg_avl_user_engagements) AND (avl_user_total_retweets < i3.avg_avl_user_total_retweets) AND (avl_user_total_likes < i3.avg_avl_user_total_likes) AND (avl_user_total_replies < i3.avg_avl_user_total_replies)), med OTHERWISE; STORE top INTO '/tmp/inf_top' USING JsonStorage(); STORE bot INTO '/tmp/inf_bot' USING JsonStorage(); STORE med INTO '/tmp/inf_med' USING JsonStorage();
2. sample 1 record from Input:
{"user_id":"777576939330699265","user_created_at":"2016-09-18T14:36:03.000-04:00","user_type":"corporate","tweet_created_at":"2016-10-13T22:18:47.000-04:00","avl_user_days_active":"25","user_notifications":"false","user_follow_request_sent":"false","user_following_count":"136","user_name":"VG RetroDeals Sega","user_time_zone":"Paris","user_profile_background_color":"000000","user_translation_enabled":"false","user_profile_link_color":"1B95E0","user_UTC_offset":"7200","user_profile_sidebar_border_color":"000000","user_has_extended_profile":"true","user_profile_background_tile":"false","user_profile_use_background_image":"false","user_default_profile_image":"false","user_description":"#Sega Videogames RetroDeals available on eBay: #retrogaming #Genesis #MegaDrive #SegaCD #MegaCD #GameGear #SegaSaturn #Dreamcast #MasterSystem #Mark3 RT","user_profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","user_profile_sidebar_fill_color":"000000","user_followers_count":"133","user_profile_image_url":"http://pbs.twimg.com/profile_images/777581486266576897/7WB52yjV_normal.jpg","user_geo_enabled":"false","user_entities_description_urls":"[]","user_screen_name":"VGDealsSega","user_total_liked":"43","user_url":"https://t.co/Px0pVPxM1E","user_total_posts":"2162","user_default_profile":"false","user_language":"fr","user_protected":"false","avl_user_brand_association":"{\"virtua\":2}","user_total_public_lists":"13","user_profile_image_url_https":"https://pbs.twimg.com/profile_images/777581486266576897/7WB52yjV_normal.jpg","user_contributors_enabled":"false","user_following":"false","user_verified":"false","avl_user_post_frequency":86.48,"avl_user_like_frequency":1.72,"avl_user_follower_following_ratio":0.9779411764705882,"user_is_translator":"false","user_location":"Planet RetroGaming","user_profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","user_profile_banner_url":"https://pbs.twimg.com/profile_banners/777576939330699265/1474224793","user_profile_text_color":"000000","avl_user_total_retweets":1,"avl_user_total_likes":1,"avl_user_total_replies":null,"avl_user_engagements":2,"user_reply_to_reply_count":null,"avl_user_reply_rate":0}
3. Verification that i1 has 165 records:
It reads 165 records from the input data file. Also, I just added the line "STORE i1 INTO '/tmp/inf' after generating i1 and the output is:
Input(s): Successfully read 165 records (362590 bytes) from: "/user/alenza/virtua/marketanalysis/data/results/analysis_name/tweets_combined/user" Output(s): Successfully stored 165 records (27243 bytes) in: "/tmp/inf" Successfully stored 0 records (815 bytes) in: "/tmp/inf_med" Successfully stored 0 records in: "/tmp/inf_top" Successfully stored 0 records in: "/tmp/inf_bot"
Created 11-17-2016 02:52 PM
@Sree Kupp I have tested this extensively and believe there is a bug in SPLIT OTHERWISE against pig EVAL functions (like avg). I will submit a JIRA and update with the number.
But there is a workaround shown below
-------------------------
Here is what I did to replicate the issue and identify a workaround:
dataset
777576939330699265,0,3 777576939330699261,1,3 777576939330699262,2,2 777576939330699263,3,1 777576939330699264,4,1
script 1 (replicates your issue)
data_input_2 = LOAD 'data_tweets' USING PigStorage(','); i1 = FOREACH data_input_2 GENERATE $0 as user_id, (int)$1 as user_followers_count, (int)$2 as avl_user_total_retweets; i2 = GROUP i1 ALL; i3 = FOREACH i2 GENERATE AVG(i1.user_followers_count) AS avg_user_followers_count, AVG(i1.avl_user_total_retweets) AS avg_avl_user_total_retweets; SPLIT i1 INTO top IF(user_followers_count > i3.avg_user_followers_count), bot IF(user_followers_count < i3.avg_user_followers_count), med OTHERWISE; STORE i1 INTO 'tmp/inf_1' USING JsonStorage(); STORE i2 INTO 'tmp/inf_2' USING JsonStorage(); STORE i3 INTO 'tmp/inf_3' USING JsonStorage(); STORE top INTO 'tmp/split_top' USING JsonStorage(); STORE bot INTO 'tmp/split_bot' USING JsonStorage(); STORE med INTO 'tmp/split_med' USING JsonStorage();
Results
tmp/inf_1 and tmp/inf_2: results as expected (5 records, correct grouping) tmp/inf_3: {"avg_user_followers_count":2.0,"avg_avl_user_total_retweets":2.0}
tmp/split_top {"user_id":"777576939330699263","user_followers_count":3,"avl_user_total_retweets":1} {"user_id":"777576939330699264","user_followers_count":4,"avl_user_total_retweets":1} tmp/split_med: no records tmp/split_bot {"user_id":"777576939330699265","user_followers_count":0,"avl_user_total_retweets":3} {"user_id":"777576939330699261","user_followers_count":1,"avl_user_total_retweets":3}
Here I took the average and hard-coded it into the split. This works (but is an inconvenient hack).
script 2 (hard-coding of average, which works)
Everything the same except I used the following for split, based on dump of i3
SPLIT i1 INTO top IF(user_followers_count > 2.0), bot IF(user_followers_count < 2.0), med OTHERWISE;
Results
Results were identical to first script, except:
tmp/split_med: {"user_id":"777576939330699262","user_followers_count":2,"avl_user_total_retweets":2}
Find the values in i3 and then hardcode them in the SPLIT statement.
-------------------------
If this answers your question (I hope so 🙂 ), please let me know by accepting the answer; else let me know if there are remaining gaps.
Created 11-17-2016 02:57 PM
Hi @Greg Keys
SPLIT OTHERWISE is working now in my code. Not sure how and why. I was sure it did not work when I first used it.
Thanks for all your time and effort.
Created 11-17-2016 03:01 PM
Good to hear. Would love to know why it is working (especially since my stripped down version replicated the issue).