question Re: How to efficiently map data columns from external files (around 5000) to a hbase column set? in Archives of Support Questions (Read Only)

How to efficiently map data columns from external files (around 5000) to a hbase column set?

SRoy — Thu, 10 Mar 2016 04:03:36 GMT

I have data coming in small files from a set of 5000 data providers. Since the data is coming from external providers the files have no id field. So I need to read each line and search HBase to find the id. At the moments I am avoiding/ignoring the complexity of creating a new id if non found by search. Since the data coming have different no of columns and formats I have to create a common format for storing all the data in a HBase table.

Now my question is there a tool or trick that can help me efficiently do the mapping of fields from these 5000 data formats to a common format. Also how will I manage when the data format is modified by the data provider.

If anybody has implemented such a system or has recommendations I would be glad to hear.

Re: How to efficiently map data columns from external files (around 5000) to a hbase column set?

rgelhausen — Thu, 10 Mar 2016 04:54:08 GMT

Hi Roy, please have a look at Apache Phoenix and its views feature. This will let you define a base set of columns (producer_id, timestamp, event_type, etc) but also within the same table create additional logical views per record type.

Your use case sounds similar to the product_metrics table and specific mobile_product_metrics example given in the link above. Once your views are defined, you can query them to get metadata to apply to the records in your ingest queue.

Phoenix Views support issuing upsert statements to write new data.

Re: changing schema- Phoenix Views can be altered at will as your schemas change.

Re: How to efficiently map data columns from external files (around 5000) to a hbase column set?

Enis — Thu, 10 Mar 2016 05:29:05 GMT

In HBase, you do not have to pre-declare the set of columns as you would in a RDBMS. You can have each row have a different set of columns which is one of the powerful features of HBase.

Phoenix exposes this, through a feature called "dynamic columns". You can declare a set of columns in the Phoenix table schema, but at query time or insertion time, you can do querying by specifying the columns on-the-fly.

Check out https://phoenix.apache.org/dynamic_columns.html for syntax.