Support Questions

Find answers, ask questions, and share your expertise

Phoenix table design

avatar
Rising Star

I am designing a table in hbase using phoenix.

Can I define a column family at the beginning and then add column qualifiers at run time? Also, when I query the table how would I know what are the column qualifiers I have in that table? Do I need to keep a manual metadata table to keep a track of the column qualifiers? It's difficult in that case as for every record the qualifiers might differ.

1 ACCEPTED SOLUTION

avatar

Hi, @rudra prasad biswas,

Column families are used to optimize RegionServer memory by separating frequently and infrequently used columns. For example, when using HBase for document storage, document metadata is queried 5-10x for every one read of document content. In this type of application, you would store document content in one column family, and document metadata in another, allowing HBase to avoid loading document content into memory until it is explicitly accessed.

As @Ryan Cicak hinted at, Phoenix tables assign columns to the first column family by default. However, you can specify which family to assign columns to:

CREATE TABLE TEST (MYKEY VARCHAR NOT NULL PRIMARY KEY, A.COL1 VARCHAR, A.COL2 VARCHAR, B.COL3 VARCHAR)

In this example, COL3 is stored in column family B, and other columns belong to family A.

Assuming there is a subset of column qualifiers you know you will use, they should be defined in the table definition (DDL). However, this is not required. See the Phoenix docs on dynamic columns and views to see how to use Phoenix to support writing arbitrary column qualifiers to a table without pre-defining those columns. Views let you keep track of and expose dynamic columns when needed.

View solution in original post

4 REPLIES 4

avatar

Hi @rudra prasad biswas,

Great questions - don't think about column families / column qualifiers, because Phoenix will interact with HBase to automatically do all this for you. Instead, simply create your table:

CREATE TABLE mytable (id integer, first_name varchar, last_name varchar CONSTRAINT my_pk PRIMARY KEY (id));

Phoenix will create this table structure on-top of HBase (automatically creating column families and qualifiers). Phoenix has the concept of dynamic columns, where you are able to upsert columns at runtime - take a look at this documentation

If you'd like to see how Phoenix is using HBase to create column families and column qualifiers, I'd recommend taking a look at the audit log in Ranger, to take a look how column families and column qualifiers are being created.

avatar

Hi, @rudra prasad biswas,

Column families are used to optimize RegionServer memory by separating frequently and infrequently used columns. For example, when using HBase for document storage, document metadata is queried 5-10x for every one read of document content. In this type of application, you would store document content in one column family, and document metadata in another, allowing HBase to avoid loading document content into memory until it is explicitly accessed.

As @Ryan Cicak hinted at, Phoenix tables assign columns to the first column family by default. However, you can specify which family to assign columns to:

CREATE TABLE TEST (MYKEY VARCHAR NOT NULL PRIMARY KEY, A.COL1 VARCHAR, A.COL2 VARCHAR, B.COL3 VARCHAR)

In this example, COL3 is stored in column family B, and other columns belong to family A.

Assuming there is a subset of column qualifiers you know you will use, they should be defined in the table definition (DDL). However, this is not required. See the Phoenix docs on dynamic columns and views to see how to use Phoenix to support writing arbitrary column qualifiers to a table without pre-defining those columns. Views let you keep track of and expose dynamic columns when needed.

avatar
Rising Star

Hi @Randy Gelhausen, @Ryan Cicak

Thanks for your reply. But my requirement is slightly different. Let me give you an example here.

I have to store the complete journey details when I plan to travel from Mumbai to NY.

Now for each person it will be different. I may want to travel from MUM-DUB-NY, another person could travel via MUM-DEL-LONDON-NY etc.

So for me the column family ('stoppages') will contain 3 qualifier but it's 4 for the other person.

Itinerary1: CF1.stop1= MUM, CF1.stop2=DUB,CF1.stop3=NY

Itinerary1: CF1.stop1= MUM, CF1.stop2=DEL,CF1.stop3=LONDON,CF1.stop4=NY

Now the question is, how would I know for each records how many qualifiers are available, unless I query like select * from <table> where key= <key>.

What if I want to fetch the details of the itineraries whose last stoppage is= NY.

avatar

Hi, @rudra prasad biswas

For this use case, consider a few different approaches:

1. Use a single array type "stoppage" column which stores each stop as a JSON element. To query for records where the last stoppage was NY, use something like "where stoppage[array_length(stoppage)-1] = 'New York".

2. Continue using dynamic columns, but include an additional column "stoppage_count" which you increment for each additional stoppage. You can use stoppage_count to tell your application which dynamic columns to include in the query.

3. Use a relational model where you have records of trips, and another table with "stoppages" linked to the trip ID.

However, the above approaches assume your queries have an initial access pattern limiting the size of the scan.

Assuming your table has primary key of customer_id, trip_id, you DON'T want to run a query like:

select * from trips where stoppages[array_length(stoppages)-1] = 'New York'

because it would be a full scan. Instead, you want it to be something like:

select * from (
  select * from trips where customer_id = '1234'
) a
where stoppages[array_length(stoppages)-1] = 'New York'

which would be a much smaller range scan.