Member since
06-03-2016
66
Posts
21
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3297 | 12-03-2016 08:51 AM | |
1776 | 09-15-2016 06:39 AM | |
1973 | 09-12-2016 01:20 PM | |
2280 | 09-11-2016 07:04 AM | |
1889 | 09-09-2016 12:19 PM |
09-08-2016
02:37 PM
@Artem Ervits thanks for your valuable explanation. By using that i have tried it in another way. I.e without storing the output to a text file and again loading back by using pigstorage, before itself i have tried to filter based on word and tried to store it in hbase. Above I have mentioned only the scenario what i need.but here is the actual script and data that i have used. Output & Script: A = foreach (group epoch BY epochtime) { data = foreach epoch generate created_at,id,user_id,text; generate group as pattern, data; }
By using this I got the below output
(word1_1473344765_265217609700,{(Wed Apr 20 07:23:20 +0000 2016,252479809098223616,450990391,rt @joey7barton: ..give a word1 about whether the americans wins a ryder cup. i mean surely he has slightly more important matters. #fami ...),(Wed Apr 22 07:23:20 +0000 2016,252455630361747457,118179886,@dawnriseth word1 and then we will have to prove it again by reelecting obama in 2016, 2020... this race-baiting never ends.)})
(word2_1473344765_265217609700,{(Wed Apr 21 07:23:20 +0000 2016,252370526411051008,845912316,@maarionymcmb word2 mere ta dit tu va resté chez toi dnc tu restes !),(Wed Apr 23 07:23:20 +0000 2016,252213169567711232,14596856,rt @chernynkaya: "have you noticed lately that word2 is getting credit for the president being in the lead except pres. obama?" ...)})
Now without dump or storing it into a file, I tried this.
B = FILTER A BY pattern = 'word1_1473325383_265214120940';
describe B;
B: {pattern: chararray,data: {(json::created_at: chararray,json::id: chararray,json::user_id: chararray,json::text: chararray)}}
STORE B into 'hbase://word1_1473325383_265214120940' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:data');
Output given as success but there is no data stored into table.When I checked the logs below is the warning. 2016-09-08 19:45:46,223 [Readahead Thread #2] WARN org.apache.hadoop.io.ReadaheadPool - Failed readahead on ifile
EBADF: Bad file descriptor
Please don't hesitate to suggest me what I am missing here. thank you.
... View more
09-08-2016
12:01 PM
3 Kudos
Hi All, We are trying to migrate our existing RDBMS(Sql Database) system to hadoop. We are planning to use hbase for the same. But we are not getting how to denormalize sql data to store it in hbase column format. Is it possible? If yes then what would be the best approach for that? Which hbase version is required for this? Any suggestions.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HBase
09-07-2016
01:49 PM
thanks for your reply Artem Ervits. can you please give me an example for that.It will be so helpfull for me.
... View more
09-07-2016
01:34 PM
1 Kudo
Hi All, How can we store the output of pig into multiple hbase tables. Hbase tables are already created, need to store the each specific value into specific table. For EX: I have got the output as (word1){data}
(word2){data}
(word3){data}
(word4){data}
So I need to store output into already created tables. Table Names are like word1
word2
word3
word4
Now output should be store in already created tables as word1 ----> (word1){data}
word2 ----> (word2){data}
word3 ----> (word3){data}
Any suggestions. thank you.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HBase
-
Apache Pig
09-07-2016
12:09 PM
1 Kudo
Trying to load the json file which is having null values in it by using elephant-bird JsonLoader. sample.json {"created_at":"Mon Aug 22 10:48:23 +0000 2016","id":767674772662607873,"id_str":"767674772662607873","text":"KPIT Image Result for https:\/\/t.co\/Nas2ZnF1zZ... https:\/\/t.co\/9TnelwtIvm","source":"\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":123,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/Nas2ZnF1zZ","expanded_url":"http:\/\/miltonious.com\/","display_url":"miltonious.com","indices":[24,47]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1471862903167"} script: REGISTER piggybank.jar
REGISTER json-simple-1.1.1.jar
REGISTER elephant-bird-pig-4.3.jar
REGISTER elephant-bird-core-4.1.jar
REGISTER elephant-bird-hadoop-compat-4.3.jar
json = LOAD 'sample.json' USING JsonLoader('created_at:chararray, id:chararray, id_str:chararray, text:chararray, source:chararray, in_reply_to_status_id:chararray, in_reply_to_status_id_str:chararray, in_reply_to_user_id:chararray, in_reply_to_user_id_str:chararray, in_reply_to_screen_name:chararray, geo:chararray, coordinates:chararray, place:chararray, contributors:chararray, is_quote_status:bytearray, retweet_count:long, favorite_count:chararray, entities:map[], favorited:bytearray, retweeted:bytearray, possibly_sensitive:bytearray, lang:chararray');
describe json; dump json; When I dump json,I am getting the following output and the worning (Mon Aug 22 10:48:23 +0000 2016,767674772662607873,767674772662607873,google Image Result for Twitter Web Client,false,1234,12345,3214,43215,,,,,,,,,,,,,,) WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.JsonLoader(UDF_WARNING_1): Bad record, returning null for {complete json} By warning i guess it is getting NULL values. So how can we load a Json which is having null values in it. And I have tried in another way i.e json = LOAD 'sample.json' USING com.twitter.elephantbird.pig.load.JsonLoader('created_at:chararray, id:chararray, id_str:chararray, text:chararray, source:chararray, in_reply_to_status_id:chararray, in_reply_to_status_id_str:chararray, in_reply_to_user_id:chararray, in_reply_to_user_id_str:chararray, in_reply_to_screen_name:chararray, geo:chararray, coordinates:chararray, place:chararray, contributors:chararray, is_quote_status:bytearray, retweet_count:long, favorite_count:chararray, entities:map[], favorited:bytearray, retweeted:bytearray, possibly_sensitive:bytearray, lang:chararray');
describe json; Output Schema for json unknown. Please suggest me. Thanks.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
09-07-2016
07:18 AM
I think i got it on my own. here is the script that i have written. res = FILTER c BY (data::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));
epoch = FOREACH res GENERATE CONCAT(CONCAT(CONCAT(word,'_'),(chararray)ToUnixTime(CurrentTime(created_at))) as epochtime;
res1= foreach (group epoch by epochtime){data}
dump res1;
... View more
09-06-2016
07:09 PM
Hi all, Sorry for the wrong phrasing of question. I have a scenario where to process the words.t file and data.txt file. words.txt word1
word2
word3
word4 data.txt {"created_at":"18:47:31,Sun Sep 30 2012","text":"RT @Joey7Barton: ..give a word1 about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...","user_id":450990391,"id":252479809098223616} I need to get the output as (word1_epochtime){complete data which matched in text attribute}
i.e
(word1_1234567890){"created_at":"18:47:31,Sun Sep 30 2012","text":"RT @Joey7Barton: ..give a word1 about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...","user_id":450990391,"id":252479809098223616} I have got the ouput as (word1){"created_at":"18:47:31,Sun Sep 30 2012","text":"RT @Joey7Barton: ..give a word1 about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...","user_id":450990391,"id":252479809098223616} by using this script. 1.load words.txt
2.load data.txt
c = cross words,data;
d = FILTER c BY (data::text MATCHES CONCAT(CONCAT('.*',words::word),'.*'));
e = foreach (group d BY word) {data); and I got the epochtime with the words as <code>time = FOREACH words GENERATE CONCAT(CONCAT(word,'_'),(chararray)ToUnixTime(CurrentTime(created_at)));
But I am unable to CONCAT the words with time. How can i get the output as (word1_epochtime){data}
Please feel free to suggest me for the above. Mohan.V
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
09-06-2016
01:07 PM
I think i found the answer on my own. B = FOREACH words GENERATE CONCAT(CONCAT(word,'_'),(chararray)ToUnixTime(CurrentTime())); I just removed A. from the inner CONCAT. And it worked fine.
... View more
09-06-2016
06:40 AM
Hi all, I am new to pig and trying to learn on my own. I have written a script to get the epoch time with a word that is reading from words.txt file. Here is the script. words = LOAD 'words.txt' AS word:chararray;
B = FOREACH A GENERATE CONCAT(CONCAT(A.word,'_'),(chararray)ToUnixTime(CurrentTime());
dump B; But the issue is, if words.txt file have only one word it is giving proper output. If it is having multiple words like word1
word2
word3
word4
then it is giving the following error ERROR 1066: Unable to open iterator for alias B
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (word1 ), 2nd :(word2) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (word1 ), 2nd :(word2) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:122)
at o Please suggest me to solve this issue. Thank you. Mohan.V
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
09-02-2016
02:21 PM
I think i got it too. please correct me if im wrong
A = LOAD 'words.txt' AS (word:chararray); B = FOREACH A GENERATE CONCAT(CONCAT(A.word,'_'),(chararray)ToUnixTime(CurrentTime());
dump B;
... View more
- « Previous
- Next »