- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Limitations on # of Hive Columns
- Labels:
-
Apache HBase
-
Apache Hive
-
Apache Phoenix
Created ‎12-14-2015 03:08 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is the Hive Column count Limit? When should you consider moving the table into Hbase/Phoenix due to performance issues? Is 100,000 columns too many for Hive to handle?
Thanks,
Created ‎12-14-2015 03:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Andrew
I am not aware of any "hard" limitation in hive in regards to column count, there are some on column size though. This being said a restriction on column count would also probably depend on the file format, ORC having indexes and predicate pushdown does not behave as a Text file would. ORC has configurations for number of rows that are grouped together for an index.
In Hive issue: https://issues.apache.org/jira/browse/HIVE-7250 for example the number of columns > 1K created memory pressure in ORC resulting in OOM. In test 15K columns were loaded and saw OOM only at 20K columns.
Regarding the Hbase/phoenix tradeoff, I would not base my decision on this metric. Hbase and Phoenix really shine on rowkey look ups with a custom built rowkey, by custom I mean containing some of your search logic. Regardless of having 1 or 1 million columns if you force hbase to do a full scan on millions of rows performance will not shine. In this scenario the granular lookup scenario is more important than number of columns. Once you have defined you can search on a rowkey logic, then yes the number of columns and performance tradeoff can be delt with differently in Hbase/Phoenix, using column families, filters etc... If you were however to put a Hive external table on it you would come back to your initial situation.
Created ‎12-14-2015 03:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Andrew
I am not aware of any "hard" limitation in hive in regards to column count, there are some on column size though. This being said a restriction on column count would also probably depend on the file format, ORC having indexes and predicate pushdown does not behave as a Text file would. ORC has configurations for number of rows that are grouped together for an index.
In Hive issue: https://issues.apache.org/jira/browse/HIVE-7250 for example the number of columns > 1K created memory pressure in ORC resulting in OOM. In test 15K columns were loaded and saw OOM only at 20K columns.
Regarding the Hbase/phoenix tradeoff, I would not base my decision on this metric. Hbase and Phoenix really shine on rowkey look ups with a custom built rowkey, by custom I mean containing some of your search logic. Regardless of having 1 or 1 million columns if you force hbase to do a full scan on millions of rows performance will not shine. In this scenario the granular lookup scenario is more important than number of columns. Once you have defined you can search on a rowkey logic, then yes the number of columns and performance tradeoff can be delt with differently in Hbase/Phoenix, using column families, filters etc... If you were however to put a Hive external table on it you would come back to your initial situation.
Created ‎12-14-2015 05:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The scalability of metastore(jetty) + database chosen will also impact the performance of queries and may restrict success rate at every query execution , with increase in number of columns per table.
Would recommend checking out the yahoo presentation on hive metastore scalability.
Created ‎12-15-2015 07:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As @nmaillard said ,Hive places a limit on the length of text in the query that it is writing into the database.If you look at the call stack you can probably find out. So input format is also key factor in extending the hive column number.
