Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best way to ensure null values are not stored in HBase

avatar
Rising Star

We currently have a Pig script that just loads in image (blob) data using AvroStorage with a predefined Avro schema, then stores into HBase with HBaseStorage specifying which columns to use.

Each row from the original DB consists of a row ID and 5 image columns, although any number of the image columns could be empty/NULL. e.g.

KEYIMG1IMG2IMG3
1nullblobblob
2blobnullblob

In the HBase table, the column family for the images is i, with column names i1, i2, etc.

It was my understanding that any cells containing a NULL value would automatically not be stored in HBase, however those cells are being stored as a key-value pair with a full key and an empty value '', rather than not existing.

Is this the normal/expected behaviour?

If not, what is the best way to get around it? Do I have to project the key with each of the five columns, filter out nulls and store them individually (i.e. store 5 times)?

Or could it be related to a difference in the way Avro, Pig and HBase all represent null values? Is there a simple type conversion I could do on all the image columns so that they would automatically not be stored if they are empty?

Versions: we are running plain HDP 2.1, soon to be upgraded to 2.4

1 ACCEPTED SOLUTION

avatar
Contributor

@Emily Sharpe I believe your issue relates to the way Pig is processing the NULL Avro data. Rather than ignoring those NULL values, Pig passes the key and an empty value to HBase, which dutifully stores it.

To avoid storing these values, filter them out. The following Pig code shows how to do this for a single key/value Avro source:

ImageAvro = LOAD '/path/to/RawAvroData' USING org.apache.pig.piggybank.storage.avro.AvroStorage ('no_schema_check', 'schema_file', '/path/to/AvroSchemaFile.avsc');
filteredImage = FOREACH (FILTER ImageAvro BY SIZE(ImageColumn) > 0) GENERATE KeyColumn, ImageColumn;
STORE filteredImage INTO 'hbase://namespace:table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colFamily:col');

Similarly, you can identify the empty cells with the same FILTER operation. Here's how to save a list of keys that have an empty colFamily:col cell:

ImageHBase = LOAD 'hbase://namespace:table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colFamily:col', '-loadKey true') as (KeyColumn:chararray, ImageColumn:bytearray);
NullImage = FOREACH (FILTER ImageHBase BY SIZE(ImageColumn) == 0) GENERATE KeyColumn;
STORE NullImage into '/path/to/flat/file' USING PigStorage;

View solution in original post

7 REPLIES 7

avatar
Master Guru

@Emily Sharpe Its all based on how you query the table. Nulls are ok in hbase as they do not take up any space in storage. You are getting the empty/null value back for the field since that column is populated by other rows. it is just stating the fact that this specific map key/value is not populated when traversing through all the values in the column.

Did I understand your ? correctly?

avatar
Master Guru

@Emily Sharpe If there are "special" characters in the field which are not visible you can run a co-processor to remove any special characters prior to upsert.

avatar
Rising Star

@Sunile Manjee thank you, yes I believe so.

I have read both that every cell stores a full row key, and that empty cells do not exist at all in HBase. So I think I expect the behaviour of a get command for a specified cell to bring back either a key and a non-empty value, or nothing (no key) as it "does not exist".

If I request a cell that I know does not exist e.g. a real key, real column family, fake column qualifier, I get no errors but 0 rows returned. However for a cell that "doesn't exist" based on a null value, I get one row returned (the key and empty value). How does the get command know to return the key if it isn't stored as a part of that cell? Does this come from metadata rather than the cell itself?

In any case, I hope this explains my confusion.

avatar
Master Guru

@Emily Sharpe you can create columns as place holders. I have done this in a few projects. If you want to filter out you can use api setFilterIfIMissing and singlecolumnvaluefilter.

avatar
Contributor

@Emily Sharpe I believe your issue relates to the way Pig is processing the NULL Avro data. Rather than ignoring those NULL values, Pig passes the key and an empty value to HBase, which dutifully stores it.

To avoid storing these values, filter them out. The following Pig code shows how to do this for a single key/value Avro source:

ImageAvro = LOAD '/path/to/RawAvroData' USING org.apache.pig.piggybank.storage.avro.AvroStorage ('no_schema_check', 'schema_file', '/path/to/AvroSchemaFile.avsc');
filteredImage = FOREACH (FILTER ImageAvro BY SIZE(ImageColumn) > 0) GENERATE KeyColumn, ImageColumn;
STORE filteredImage INTO 'hbase://namespace:table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colFamily:col');

Similarly, you can identify the empty cells with the same FILTER operation. Here's how to save a list of keys that have an empty colFamily:col cell:

ImageHBase = LOAD 'hbase://namespace:table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colFamily:col', '-loadKey true') as (KeyColumn:chararray, ImageColumn:bytearray);
NullImage = FOREACH (FILTER ImageHBase BY SIZE(ImageColumn) == 0) GENERATE KeyColumn;
STORE NullImage into '/path/to/flat/file' USING PigStorage;

avatar
Rising Star

Hi @Sunile Manjee

@Steven O'Neill's solution above fixed the issue. Originally, trying to retrieve information for a cell which should be null looked like:

  • Using REST API and going to the HBase URL: Internet explorer would download a 0KB file
  • HBase Shell get command: 1 row returned, "timestamp=14856365476857, value="

After doing a pig filter and store for each individual column, trying to retrieve the same cell looks like:

  • REST: http error 404
  • HBase Shell get: 0 rows returned

So the empty cells were not being automatically dropped out plus HBase was storing the key for the cells with no value.

avatar
Master Guru

@Emily Sharpe thanks for the insights.