About emilysharpe

emilysharpe · ‎07-05-2016

We currently have a Pig script that just loads in image (blob) data using AvroStorage with a predefined Avro schema, then stores into HBase with HBaseStorage specifying which columns to use. Each row from the original DB consists of a row ID and 5 image columns, although any number of the image columns could be empty/NULL. e.g. KEY IMG1 IMG2 IMG3 1 null blob blob 2 blob null blob In the HBase table, the column family for the images is i, with column names i1, i2, etc. It was my understanding that any cells containing a NULL value would automatically not be stored in HBase, however those cells are being stored as a key-value pair with a full key and an empty value '', rather than not existing. Is this the normal/expected behaviour? If not, what is the best way to get around it? Do I have to project the key with each of the five columns, filter out nulls and store them individually (i.e. store 5 times)? Or could it be related to a difference in the way Avro, Pig and HBase all represent null values? Is there a simple type conversion I could do on all the image columns so that they would automatically not be stored if they are empty? Versions: we are running plain HDP 2.1, soon to be upgraded to 2.4

emilysharpe · ‎06-30-2016

I haven't tested it, but I believe using -tagFile will prepend the file name, which will place it at position 0 instead of 1. I.e. GENERATE (chararray)$0 AS Filename, (chararray)$1 AS ID, etc. Hope this solves it!

emilysharpe · ‎06-28-2016

Hi @João Souza, no problem. Yes you should still be able to use split, just with IF (date=='2016-06-23') comparing string type instead of date type. Hope this helps!

emilysharpe · ‎06-27-2016

There's currently no mechanism to force the name of MapReduce output files. Once you've loaded all the data and added the extra column, you can split your alias into one per date, then store each one in a different directory. e.g. SPLIT Src INTO Src23 IF date==ToDate('2016-06-23', 'yyyy-MM-dd'), Src24 IF date==ToDate('2016-06-24', 'yyyy-MM-dd'), Src25 IF date==ToDate('2016-06-23', 'yyyy-MM-dd'); STORE Src23 INTO '/data/Src/2016-06-23' using PigStorage(' '); This way, you could merge the output files in each date directory using -getmerge (and specify the resulting file name), and then copy them back onto HDFS. Another option is to force a reduce job to occur (yours is map only), and and set PARALLEL 1. It will be a slower job, but you will get one output file. E.g. Ordered23 = ORDER Src23 BY somecolumn PARALLEL 1; STORE Ordered23 INTO '/data/Src/2016-06-23' using PigStorage(' '); You would still have to rename the files outside of this process.

emilysharpe · ‎04-28-2016

Nice solution @Predrag Minovic. Simple and neat, thanks! +1

emilysharpe · ‎03-18-2016

Thanks @Rushikesh Deshmukh for your response. Using these, how would you recommend 'correcting' an existing store of data - compactions reduce the number of files per region, but how would you reduce the number of existing regions? Is this possible with the current status of merge tools?

emilysharpe · ‎03-15-2016

http://hbase.apache.org/0.94/book/important_configurations.html suggests manually managing HBase region splits. Do others in the community do this? If so: Do you have a use case or example regarding the steps required (including setting hbase.hregion.max.filesize) and how/what you use to implement them? and Have you found it worthwhile in terms of effort vs benefits? Thanks

emilysharpe · ‎02-26-2016

Excellent reference, thanks!

emilysharpe · ‎12-16-2015

Hi @Chris Nauroth thanks for the confirmation, and great to know the option has been suggested 🙂

emilysharpe · ‎12-16-2015

Hi @Neeraj Sabharwal, than you for the script line - looks like i will be adding that in!

Online	Offline
Last Visited	‎11-16-2018 12:44 AM

Member Since	‎12-09-2015 11:26 PM
Last Visited	‎11-16-2018 12:44 AM
Posts	35
Kudos received	13

Cloudera Community

Re: Merge and Rename files in HDFS - Pig?

Best way to ensure null values are not stored in H...

Re: Merge and Rename files in HDFS - Pig?

Re: Merge and Rename files in HDFS - Pig?

Re: Merge and Rename files in HDFS - Pig?

Re: I want to import certain tables from multiple ...

Re: How to manually manage number of HBase regions...

How to manually manage number of HBase regions?

Re: HDFS Permission Checks

Re: Is there an -ignoreCrc equivalent when using g...

Re: Is there an -ignoreCrc equivalent when using g...