Options
- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Solved
Go to solution
Column Based Table Vs Row Based Table
Labels:
- Labels:
-
Apache Hive
New Contributor
Created on ‎05-05-2015 01:07 PM - edited ‎09-16-2022 02:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone!
The question is What is better for the performace, we have a table with a PK and a measure variable day by day:
- To have the PK and one column with the measure day by day (365 columns for a given year)
- To have the PK, the date column and one columsn with the measure.
Basically I need to know what would be better for HIVE. An structure based on rows or based on columns.
1 ACCEPTED SOLUTION
Contributor
Created ‎05-07-2015 08:34 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I recommend the 2nd option where you have 3 columns only: (PK, DATE, MEASURE).
You cannot update records on Hive, so having the 365 columns will leave 364 columns unused, and this causes extra storage on your files (like separators chars, schema information, etc).
Also, for read performance, 3 columns is still better than 365. Hive reads the full record every time you do a query, it then selects the columns you want, and applies the filter from the WHERE statement. This select/filter will happen with 3 or 365 columns, so 3 will be faster.
Also, you're queries would be shorter, as you only need to filter the query by date (instead of looking for columns that have measure data). And, if you use columnar storage files (like Parquet), this filter may be faster.
You cannot update records on Hive, so having the 365 columns will leave 364 columns unused, and this causes extra storage on your files (like separators chars, schema information, etc).
Also, for read performance, 3 columns is still better than 365. Hive reads the full record every time you do a query, it then selects the columns you want, and applies the filter from the WHERE statement. This select/filter will happen with 3 or 365 columns, so 3 will be faster.
Also, you're queries would be shorter, as you only need to filter the query by date (instead of looking for columns that have measure data). And, if you use columnar storage files (like Parquet), this filter may be faster.
1 REPLY 1
Contributor
Created ‎05-07-2015 08:34 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I recommend the 2nd option where you have 3 columns only: (PK, DATE, MEASURE).
You cannot update records on Hive, so having the 365 columns will leave 364 columns unused, and this causes extra storage on your files (like separators chars, schema information, etc).
Also, for read performance, 3 columns is still better than 365. Hive reads the full record every time you do a query, it then selects the columns you want, and applies the filter from the WHERE statement. This select/filter will happen with 3 or 365 columns, so 3 will be faster.
Also, you're queries would be shorter, as you only need to filter the query by date (instead of looking for columns that have measure data). And, if you use columnar storage files (like Parquet), this filter may be faster.
You cannot update records on Hive, so having the 365 columns will leave 364 columns unused, and this causes extra storage on your files (like separators chars, schema information, etc).
Also, for read performance, 3 columns is still better than 365. Hive reads the full record every time you do a query, it then selects the columns you want, and applies the filter from the WHERE statement. This select/filter will happen with 3 or 365 columns, so 3 will be faster.
Also, you're queries would be shorter, as you only need to filter the query by date (instead of looking for columns that have measure data). And, if you use columnar storage files (like Parquet), this filter may be faster.
