Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Pig: OTHERWISE keyword in SPLIT not working.,Pig "Otherwise" does not work on Centos 7 Cluster.

avatar
Contributor
I have a pig script where I want to split an alias depending on some conditions. The script is:
SPLIT i1 INTO
   top IF((user_followers_count > i3.avg_user_followers_count) AND (avl_user_engagements > i3.avg_avl_user_engagements)),
   bot IF((user_followers_count < i3.avg_user_followers_count) AND (avl_user_engagements < i3.avg_avl_user_engagements)),
   med OTHERWISE;

But looking at the output, I think the code is not working as expected.

Input(s):
Successfully read 165 records (362590 bytes) from: "/tmp/user"

Output(s):
Successfully stored 0 records (815 bytes) in: "/tmp/inf_med"
Successfully stored 1 records (163 bytes) in: "/tmp/inf_top"
Successfully stored 113 records (18068 bytes) in: "/tmp/inf_bot"

There are about 50 records that have to go into the med category.

My pig version is: 0.15.0.2.4.2.0-258

HDP: 2.4.2.0-258

1 ACCEPTED SOLUTION

avatar
Guru

Otherwise was introduced in pig 0.10 and is a very solid feature. There should not be an issue with it.

Could you:

  • provide full script
  • provide sample data set
  • verify that i1 has 165 records

You can add to these comments

View solution in original post

5 REPLIES 5

avatar
Guru

Otherwise was introduced in pig 0.10 and is a very solid feature. There should not be an issue with it.

Could you:

  • provide full script
  • provide sample data set
  • verify that i1 has 165 records

You can add to these comments

avatar
Contributor

@Greg Keys

Thanks for your time.

1. My full script:

register /usr/hdp/current/pig-client/lib/piggybank.jar
register /opt/elephantbird-jars/elephant-bird-core-4.5.jar
register /opt/elephantbird-jars/elephant-bird-hadoop-compat-4.5.jar
register /opt/elephantbird-jars/elephant-bird-pig-4.5.jar
register /opt/elephantbird-jars/json-simple-1.1.1.jar

data_input_2 = LOAD '/tmp/user' USING com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map []); i1 = FOREACH data_input_2 GENERATE json#'user_id' as user_id, (int)json#'user_followers_count' as user_followers_count, (int)json#'avl_user_total_retweets' as avl_user_total_retweets, (int)json#'avl_user_total_likes' as avl_user_total_likes, (int)json#'avl_user_total_replies' as avl_user_total_replies, (int)json#'avl_user_engagements' as avl_user_engagements;

i2 = GROUP i1 ALL;
i3 = FOREACH i2 GENERATE AVG(i1.user_followers_count) AS avg_user_followers_count,  AVG(i1.avl_user_total_retweets) AS avg_avl_user_total_retweets, AVG(i1.avl_user_total_likes) AS avg_avl_user_total_likes, AVG(i1.avl_user_total_replies) AS avg_avl_user_total_replies, AVG(i1.avl_user_engagements) AS avg_avl_user_engagements;

SPLIT i1 INTO
   top IF((user_followers_count > i3.avg_user_followers_count) AND (avl_user_engagements > i3.avg_avl_user_engagements) AND (avl_user_total_retweets > i3.avg_avl_user_total_retweets) AND (avl_user_total_likes > i3.avg_avl_user_total_likes) AND (avl_user_total_replies > i3.avg_avl_user_total_replies)),
   bot IF((user_followers_count < i3.avg_user_followers_count) AND (avl_user_engagements < i3.avg_avl_user_engagements) AND (avl_user_total_retweets < i3.avg_avl_user_total_retweets) AND (avl_user_total_likes < i3.avg_avl_user_total_likes) AND (avl_user_total_replies < i3.avg_avl_user_total_replies)),
   med OTHERWISE;

STORE top INTO '/tmp/inf_top' USING JsonStorage();
STORE bot INTO '/tmp/inf_bot' USING JsonStorage();
STORE med INTO '/tmp/inf_med' USING JsonStorage();

2. sample 1 record from Input:

{"user_id":"777576939330699265","user_created_at":"2016-09-18T14:36:03.000-04:00","user_type":"corporate","tweet_created_at":"2016-10-13T22:18:47.000-04:00","avl_user_days_active":"25","user_notifications":"false","user_follow_request_sent":"false","user_following_count":"136","user_name":"VG RetroDeals Sega","user_time_zone":"Paris","user_profile_background_color":"000000","user_translation_enabled":"false","user_profile_link_color":"1B95E0","user_UTC_offset":"7200","user_profile_sidebar_border_color":"000000","user_has_extended_profile":"true","user_profile_background_tile":"false","user_profile_use_background_image":"false","user_default_profile_image":"false","user_description":"#Sega Videogames RetroDeals available on eBay:  #retrogaming  #Genesis #MegaDrive #SegaCD #MegaCD #GameGear #SegaSaturn #Dreamcast #MasterSystem #Mark3 RT","user_profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","user_profile_sidebar_fill_color":"000000","user_followers_count":"133","user_profile_image_url":"http://pbs.twimg.com/profile_images/777581486266576897/7WB52yjV_normal.jpg","user_geo_enabled":"false","user_entities_description_urls":"[]","user_screen_name":"VGDealsSega","user_total_liked":"43","user_url":"https://t.co/Px0pVPxM1E","user_total_posts":"2162","user_default_profile":"false","user_language":"fr","user_protected":"false","avl_user_brand_association":"{\"virtua\":2}","user_total_public_lists":"13","user_profile_image_url_https":"https://pbs.twimg.com/profile_images/777581486266576897/7WB52yjV_normal.jpg","user_contributors_enabled":"false","user_following":"false","user_verified":"false","avl_user_post_frequency":86.48,"avl_user_like_frequency":1.72,"avl_user_follower_following_ratio":0.9779411764705882,"user_is_translator":"false","user_location":"Planet RetroGaming","user_profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","user_profile_banner_url":"https://pbs.twimg.com/profile_banners/777576939330699265/1474224793","user_profile_text_color":"000000","avl_user_total_retweets":1,"avl_user_total_likes":1,"avl_user_total_replies":null,"avl_user_engagements":2,"user_reply_to_reply_count":null,"avl_user_reply_rate":0}

3. Verification that i1 has 165 records:

It reads 165 records from the input data file. Also, I just added the line "STORE i1 INTO '/tmp/inf' after generating i1 and the output is:

Input(s):
Successfully read 165 records (362590 bytes) from: "/user/alenza/virtua/marketanalysis/data/results/analysis_name/tweets_combined/user"
Output(s):
Successfully stored 165 records (27243 bytes) in: "/tmp/inf"
Successfully stored 0 records (815 bytes) in: "/tmp/inf_med"
Successfully stored 0 records in: "/tmp/inf_top"
Successfully stored 0 records in: "/tmp/inf_bot"

avatar
Guru

@Sree Kupp I have tested this extensively and believe there is a bug in SPLIT OTHERWISE against pig EVAL functions (like avg). I will submit a JIRA and update with the number.

But there is a workaround shown below

-------------------------

Here is what I did to replicate the issue and identify a workaround:

Replicate issue

dataset

777576939330699265,0,3
777576939330699261,1,3
777576939330699262,2,2
777576939330699263,3,1
777576939330699264,4,1

script 1 (replicates your issue)

data_input_2 = LOAD 'data_tweets' USING PigStorage(','); 
i1 = FOREACH data_input_2 GENERATE $0 as user_id, (int)$1 as user_followers_count, (int)$2 as avl_user_total_retweets;
i2 = GROUP i1 ALL;
i3 = FOREACH i2 GENERATE AVG(i1.user_followers_count) AS avg_user_followers_count,  AVG(i1.avl_user_total_retweets) AS avg_avl_user_total_retweets;

SPLIT i1 INTO
   top IF(user_followers_count > i3.avg_user_followers_count),
   bot IF(user_followers_count < i3.avg_user_followers_count),
   med OTHERWISE;
 
STORE i1 INTO 'tmp/inf_1' USING JsonStorage();
STORE i2 INTO 'tmp/inf_2' USING JsonStorage();
STORE i3 INTO 'tmp/inf_3' USING JsonStorage();  
STORE top INTO 'tmp/split_top' USING JsonStorage();
STORE bot INTO 'tmp/split_bot' USING JsonStorage();
STORE med INTO 'tmp/split_med' USING JsonStorage();

Results

tmp/inf_1 and tmp/inf_2: results as expected (5 records, correct grouping)
tmp/inf_3: {"avg_user_followers_count":2.0,"avg_avl_user_total_retweets":2.0}
tmp/split_top 
{"user_id":"777576939330699263","user_followers_count":3,"avl_user_total_retweets":1}
{"user_id":"777576939330699264","user_followers_count":4,"avl_user_total_retweets":1} 
tmp/split_med: no records 
tmp/split_bot 
{"user_id":"777576939330699265","user_followers_count":0,"avl_user_total_retweets":3}
{"user_id":"777576939330699261","user_followers_count":1,"avl_user_total_retweets":3}

Workaround

Here I took the average and hard-coded it into the split. This works (but is an inconvenient hack).

script 2 (hard-coding of average, which works)

Everything the same except I used the following for split, based on dump of i3

SPLIT i1 INTO
   top IF(user_followers_count > 2.0),
   bot IF(user_followers_count < 2.0),
   med OTHERWISE; 

Results

Results were identical to first script, except:

tmp/split_med:
{"user_id":"777576939330699262","user_followers_count":2,"avl_user_total_retweets":2}

Workaround for your script

Find the values in i3 and then hardcode them in the SPLIT statement.

-------------------------

If this answers your question (I hope so 🙂 ), please let me know by accepting the answer; else let me know if there are remaining gaps.

avatar
Contributor

Hi @Greg Keys

SPLIT OTHERWISE is working now in my code. Not sure how and why. I was sure it did not work when I first used it.

Thanks for all your time and effort.

avatar
Guru

Good to hear. Would love to know why it is working (especially since my stripped down version replicated the issue).