Created 07-05-2016 04:36 AM
We currently have a Pig script that just loads in image (blob) data using AvroStorage with a predefined Avro schema, then stores into HBase with HBaseStorage specifying which columns to use.
Each row from the original DB consists of a row ID and 5 image columns, although any number of the image columns could be empty/NULL. e.g.
| KEY | IMG1 | IMG2 | IMG3 | 
| 1 | null | blob | blob | 
| 2 | blob | null | blob | 
In the HBase table, the column family for the images is i, with column names i1, i2, etc.
It was my understanding that any cells containing a NULL value would automatically not be stored in HBase, however those cells are being stored as a key-value pair with a full key and an empty value '', rather than not existing.
Is this the normal/expected behaviour?
If not, what is the best way to get around it? Do I have to project the key with each of the five columns, filter out nulls and store them individually (i.e. store 5 times)?
Or could it be related to a difference in the way Avro, Pig and HBase all represent null values? Is there a simple type conversion I could do on all the image columns so that they would automatically not be stored if they are empty?
Versions: we are running plain HDP 2.1, soon to be upgraded to 2.4
Created 07-13-2016 04:41 AM
@Emily Sharpe I believe your issue relates to the way Pig is processing the NULL Avro data. Rather than ignoring those NULL values, Pig passes the key and an empty value to HBase, which dutifully stores it.
To avoid storing these values, filter them out. The following Pig code shows how to do this for a single key/value Avro source:
ImageAvro = LOAD '/path/to/RawAvroData' USING org.apache.pig.piggybank.storage.avro.AvroStorage ('no_schema_check', 'schema_file', '/path/to/AvroSchemaFile.avsc');
filteredImage = FOREACH (FILTER ImageAvro BY SIZE(ImageColumn) > 0) GENERATE KeyColumn, ImageColumn;
STORE filteredImage INTO 'hbase://namespace:table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colFamily:col');Similarly, you can identify the empty cells with the same FILTER operation. Here's how to save a list of keys that have an empty colFamily:col cell:
ImageHBase = LOAD 'hbase://namespace:table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colFamily:col', '-loadKey true') as (KeyColumn:chararray, ImageColumn:bytearray);
NullImage = FOREACH (FILTER ImageHBase BY SIZE(ImageColumn) == 0) GENERATE KeyColumn;
STORE NullImage into '/path/to/flat/file' USING PigStorage;
Created 07-05-2016 04:40 AM
@Emily Sharpe Its all based on how you query the table. Nulls are ok in hbase as they do not take up any space in storage. You are getting the empty/null value back for the field since that column is populated by other rows. it is just stating the fact that this specific map key/value is not populated when traversing through all the values in the column.
Did I understand your ? correctly?
Created 07-05-2016 04:42 AM
@Emily Sharpe If there are "special" characters in the field which are not visible you can run a co-processor to remove any special characters prior to upsert.
Created 07-05-2016 05:36 AM
@Sunile Manjee thank you, yes I believe so.
I have read both that every cell stores a full row key, and that empty cells do not exist at all in HBase. So I think I expect the behaviour of a get command for a specified cell to bring back either a key and a non-empty value, or nothing (no key) as it "does not exist".
If I request a cell that I know does not exist e.g. a real key, real column family, fake column qualifier, I get no errors but 0 rows returned. However for a cell that "doesn't exist" based on a null value, I get one row returned (the key and empty value). How does the get command know to return the key if it isn't stored as a part of that cell? Does this come from metadata rather than the cell itself?
In any case, I hope this explains my confusion.
Created 07-06-2016 12:41 AM
@Emily Sharpe you can create columns as place holders. I have done this in a few projects. If you want to filter out you can use api setFilterIfIMissing and singlecolumnvaluefilter.
Created 07-13-2016 04:41 AM
@Emily Sharpe I believe your issue relates to the way Pig is processing the NULL Avro data. Rather than ignoring those NULL values, Pig passes the key and an empty value to HBase, which dutifully stores it.
To avoid storing these values, filter them out. The following Pig code shows how to do this for a single key/value Avro source:
ImageAvro = LOAD '/path/to/RawAvroData' USING org.apache.pig.piggybank.storage.avro.AvroStorage ('no_schema_check', 'schema_file', '/path/to/AvroSchemaFile.avsc');
filteredImage = FOREACH (FILTER ImageAvro BY SIZE(ImageColumn) > 0) GENERATE KeyColumn, ImageColumn;
STORE filteredImage INTO 'hbase://namespace:table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colFamily:col');Similarly, you can identify the empty cells with the same FILTER operation. Here's how to save a list of keys that have an empty colFamily:col cell:
ImageHBase = LOAD 'hbase://namespace:table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colFamily:col', '-loadKey true') as (KeyColumn:chararray, ImageColumn:bytearray);
NullImage = FOREACH (FILTER ImageHBase BY SIZE(ImageColumn) == 0) GENERATE KeyColumn;
STORE NullImage into '/path/to/flat/file' USING PigStorage;
Created 07-14-2016 01:34 AM
@Steven O'Neill's solution above fixed the issue. Originally, trying to retrieve information for a cell which should be null looked like:
After doing a pig filter and store for each individual column, trying to retrieve the same cell looks like:
So the empty cells were not being automatically dropped out plus HBase was storing the key for the cells with no value.
Created 07-14-2016 04:11 AM
@Emily Sharpe thanks for the insights.
 
					
				
				
			
		
