Created 09-07-2016 01:34 PM
Hi All,
How can we store the output of pig into multiple hbase tables. Hbase tables are already created, need to store the each specific value into specific table.
For EX:
I have got the output as
(word1){data}
(word2){data}
(word3){data}
(word4){data}
So I need to store output into already created tables. Table Names are like
word1
word2
word3
word4
Now output should be store in already created tables as
word1 ----> (word1){data}
word2 ----> (word2){data}
word3 ----> (word3){data}
Any suggestions.
thank you.
Created 09-07-2016 01:44 PM
you would need to assign an alias to each row and specify separate store command per row.
Created 09-07-2016 01:44 PM
you would need to assign an alias to each row and specify separate store command per row.
Created 09-07-2016 01:49 PM
thanks for your reply Artem Ervits.
can you please give me an example for that.It will be so helpfull for me.
Created 09-07-2016 04:08 PM
@Mohan V this is not efficient but does what you're asking
grunt> fs -cat text 1 a 2 b 3 c grunt> data = load 'text' using PigStorage(' ') AS (id:long, letter:chararray); grunt> A = FILTER data by letter == 'a'; grunt> B = FILTER data by letter == 'b'; grunt> C = FILTER data by letter == 'c'; grunt> STORE A into 'hbase://a' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:letter'); 2016-09-07 16:04:29,421 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 698875904 to monitor. collectionUsageThreshold = 489213120, usageThreshold = 489213120 ... grunt> STORE B into 'hbase://b' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:letter'); ... grunt> STORE C into 'hbase://c' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:letter');
now in hbase shell assuming tables were created
create 'a', 'cf' create 'b', 'cf' create 'c', 'cf'
hbase(main):001:0> scan 'a' ROW COLUMN+CELL 1 column=cf:letter, timestamp=1473264279802, value=a 1 row(s) in 0.2610 seconds hbase(main):002:0> scan 'b' ROW COLUMN+CELL 2 column=cf:letter, timestamp=1473264324881, value=b 1 row(s) in 0.0160 seconds hbase(main):003:0> scan 'c' ROW COLUMN+CELL 3 column=cf:letter, timestamp=1473264429688, value=c 1 row(s) in 0.0140 seconds
Created 09-08-2016 02:37 PM
@Artem Ervits thanks for your valuable explanation.
By using that i have tried it in another way.
I.e without storing the output to a text file and again loading back by using pigstorage, before itself i have tried to filter based on word and tried to store it in hbase.
Above I have mentioned only the scenario what i need.but here is the actual script and data that i have used.
Output & Script:
A = foreach (group epoch BY epochtime) { data = foreach epoch generate created_at,id,user_id,text; generate group as pattern, data; } By using this I got the below output (word1_1473344765_265217609700,{(Wed Apr 20 07:23:20 +0000 2016,252479809098223616,450990391,rt @joey7barton: ..give a word1 about whether the americans wins a ryder cup. i mean surely he has slightly more important matters. #fami ...),(Wed Apr 22 07:23:20 +0000 2016,252455630361747457,118179886,@dawnriseth word1 and then we will have to prove it again by reelecting obama in 2016, 2020... this race-baiting never ends.)}) (word2_1473344765_265217609700,{(Wed Apr 21 07:23:20 +0000 2016,252370526411051008,845912316,@maarionymcmb word2 mere ta dit tu va resté chez toi dnc tu restes !),(Wed Apr 23 07:23:20 +0000 2016,252213169567711232,14596856,rt @chernynkaya: "have you noticed lately that word2 is getting credit for the president being in the lead except pres. obama?" ...)}) Now without dump or storing it into a file, I tried this. B = FILTER A BY pattern = 'word1_1473325383_265214120940'; describe B; B: {pattern: chararray,data: {(json::created_at: chararray,json::id: chararray,json::user_id: chararray,json::text: chararray)}} STORE B into 'hbase://word1_1473325383_265214120940' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:data');
Output given as success but there is no data stored into table.When I checked the logs below is the warning.
2016-09-08 19:45:46,223 [Readahead Thread #2] WARN org.apache.hadoop.io.ReadaheadPool - Failed readahead on ifile EBADF: Bad file descriptor
Please don't hesitate to suggest me what I am missing here.
thank you.