Support Questions

Find answers, ask questions, and share your expertise

Limitations on # of Hive Columns

avatar

What is the Hive Column count Limit? When should you consider moving the table into Hbase/Phoenix due to performance issues? Is 100,000 columns too many for Hive to handle?

Thanks,

1 ACCEPTED SOLUTION

avatar

Hey Andrew

I am not aware of any "hard" limitation in hive in regards to column count, there are some on column size though. This being said a restriction on column count would also probably depend on the file format, ORC having indexes and predicate pushdown does not behave as a Text file would. ORC has configurations for number of rows that are grouped together for an index.

In Hive issue: https://issues.apache.org/jira/browse/HIVE-7250 for example the number of columns > 1K created memory pressure in ORC resulting in OOM. In test 15K columns were loaded and saw OOM only at 20K columns.

Regarding the Hbase/phoenix tradeoff, I would not base my decision on this metric. Hbase and Phoenix really shine on rowkey look ups with a custom built rowkey, by custom I mean containing some of your search logic. Regardless of having 1 or 1 million columns if you force hbase to do a full scan on millions of rows performance will not shine. In this scenario the granular lookup scenario is more important than number of columns. Once you have defined you can search on a rowkey logic, then yes the number of columns and performance tradeoff can be delt with differently in Hbase/Phoenix, using column families, filters etc... If you were however to put a Hive external table on it you would come back to your initial situation.

View solution in original post

3 REPLIES 3

avatar

Hey Andrew

I am not aware of any "hard" limitation in hive in regards to column count, there are some on column size though. This being said a restriction on column count would also probably depend on the file format, ORC having indexes and predicate pushdown does not behave as a Text file would. ORC has configurations for number of rows that are grouped together for an index.

In Hive issue: https://issues.apache.org/jira/browse/HIVE-7250 for example the number of columns > 1K created memory pressure in ORC resulting in OOM. In test 15K columns were loaded and saw OOM only at 20K columns.

Regarding the Hbase/phoenix tradeoff, I would not base my decision on this metric. Hbase and Phoenix really shine on rowkey look ups with a custom built rowkey, by custom I mean containing some of your search logic. Regardless of having 1 or 1 million columns if you force hbase to do a full scan on millions of rows performance will not shine. In this scenario the granular lookup scenario is more important than number of columns. Once you have defined you can search on a rowkey logic, then yes the number of columns and performance tradeoff can be delt with differently in Hbase/Phoenix, using column families, filters etc... If you were however to put a Hive external table on it you would come back to your initial situation.

avatar
Contributor

The scalability of metastore(jetty) + database chosen will also impact the performance of queries and may restrict success rate at every query execution , with increase in number of columns per table.

Would recommend checking out the yahoo presentation on hive metastore scalability.

http://www.slideshare.net/mythrocks/hadoop-summit-...

avatar
Expert Contributor

As @nmaillard said ,Hive places a limit on the length of text in the query that it is writing into the database.If you look at the call stack you can probably find out. So input format is also key factor in extending the hive column number.