Support Questions

Find answers, ask questions, and share your expertise

Can we check size of Hive tables? If so - how?

avatar
Contributor

I have many tables in Hive and suspect size of these tables are causing space issues on HDFS FS. Is there a way to check the size of Hive tables? If so - how?

 

hive> show tables;
OK
bee_actions
bee_bills
bee_charges
bee_cpc_notifs
bee_customers
bee_interactions
bee_master_03jun2016_to_17oct2016
bee_master_18may2016_to_02jun2016
bee_master_18oct2016_to_21dec2016
bee_master_20160614_021501
bee_master_20160615_010001
bee_master_20160616_010001
bee_master_20160617_010001
bee_master_20160618_010001
bee_master_20160619_010001
bee_master_20160620_010001
bee_master_20160621_010002
bee_master_20160622_010001
bee_master_20160623_010001
bee_master_20160624_065545
bee_master_20160625_010001
bee_master_20160626_010001
bee_master_20160627_010001
bee_master_20160628_010001
bee_master_20160629_010001
bee_master_20160630_010001
bee_master_20160701_010001
bee_master_20160702_010001
bee_master_20160703_010001
bee_master_20160704_010001
bee_master_20160705_010001
bee_master_20160706_010001
bee_master_20160707_010001
bee_master_20160707_040048
bee_master_20160708_010001
bee_master_20160709_010001
bee_master_20160710_010001
bee_master_20160711_010001
bee_master_20160712_010001
bee_master_20160713_010001
bee_master_20160714_010001
bee_master_20160715_010002
bee_master_20160716_010001
bee_master_20160717_010001
bee_master_20160718_010001
bee_master_20160720_010001
bee_master_20160721_010001
bee_master_20160723_010002
bee_master_20160724_010001
bee_master_20160725_010001
bee_master_20160726_010001
bee_master_20160727_010002
bee_master_20160728_010001
bee_master_20160729_010001
bee_master_20160730_010001
bee_master_20160731_010001
bee_master_20160801_010001
bee_master_20160802_010001
bee_master_20160803_010001

1 ACCEPTED SOLUTION

avatar
Champion
describe formatted/extended <table> partition <partition spec>

This will output stats like totalNumberFiles, totalFileSize, maxFileSize, minFileSize, lastAccessTime, and lastUpdateTime.

So not exactly this table is X size. It would seem that if you include the partition it will give you a raw data size.

Otherwise, hdfs dfs -du -s -h /path/to/table will do.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Describe

View solution in original post

9 REPLIES 9

avatar
Champion
describe formatted/extended <table> partition <partition spec>

This will output stats like totalNumberFiles, totalFileSize, maxFileSize, minFileSize, lastAccessTime, and lastUpdateTime.

So not exactly this table is X size. It would seem that if you include the partition it will give you a raw data size.

Otherwise, hdfs dfs -du -s -h /path/to/table will do.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Describe

avatar
Contributor

Thanks so much for your prompt reply.

I ran the suggested command but i see size as 0 whereas i know it has some data. So what does that mean?

 

hive> describe extended bee_master_20170113_010001
> ;
OK
entity_id string
account_id string
bill_cycle string
entity_type string
col1 string
col2 string
col3 string
col4 string
col5 string
col6 string
col7 string
col8 string
col9 string
col10 string
col11 string
col12 string

Detailed Table Information Table(tableName:bee_master_20170113_010001, dbName:default, owner:sagarpa, createTime:1484297904, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:entity_id, type:string, comment:null), FieldSchema(name:account_id, type:string, comment:null), FieldSchema(name:bill_cycle, type:string, comment:null), FieldSchema(name:entity_type, type:string, comment:null), FieldSchema(name:col1, type:string, comment:null), FieldSchema(name:col2, type:string, comment:null), FieldSchema(name:col3, type:string, comment:null), FieldSchema(name:col4, type:string, comment:null), FieldSchema(name:col5, type:string, comment:null), FieldSchema(name:col6, type:string, comment:null), FieldSchema(name:col7, type:string, comment:null), FieldSchema(name:col8, type:string, comment:null), FieldSchema(name:col9, type:string, comment:null), FieldSchema(name:col10, type:string, comment:null), FieldSchema(name:col11, type:string, comment:null), FieldSchema(name:col12, type:string, comment:null)], location:hdfs://cmilcb521.amdocs.com:8020/user/insighte/bee_data/bee_run_20170113_010001, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim= , serialization.format=
Time taken: 0.328 seconds, Fetched: 18 row(s)
hive> describe formatted bee_master_20170113_010001
> ;
OK
# col_name data_type comment

entity_id string
account_id string
bill_cycle string
entity_type string
col1 string
col2 string
col3 string
col4 string
col5 string
col6 string
col7 string
col8 string
col9 string
col10 string
col11 string
col12 string

# Detailed Table Information
Database: default
Owner: sagarpa
CreateTime: Fri Jan 13 02:58:24 CST 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://cmilcb521.amdocs.com:8020/user/insighte/bee_data/bee_run_20170113_010001
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE false
EXTERNAL TRUE
numFiles 0
numRows -1
rawDataSize -1
totalSize 0
transient_lastDdlTime 1484297904

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.081 seconds, Fetched: 48 row(s)
hive> describe formatted bee_ppv;
OK
# col_name data_type comment

entity_id string
account_id string
bill_cycle string
ref_event string
amount double
ppv_category string
ppv_order_status string
ppv_order_date timestamp

# Detailed Table Information
Database: default
Owner: sagarpa
CreateTime: Thu Dec 22 12:56:34 CST 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://cmilcb521.amdocs.com:8020/user/insighte/bee_data/tables/bee_ppv
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
EXTERNAL TRUE
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1484340138

# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.072 seconds, Fetched: 40 row(s)

avatar
Expert Contributor
This command should also help you get the size of HIVE table :

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [noscan];

avatar
Champion
I was wondering if stats were needed to have describe extended output the actual file size. I recall something like that.

avatar
New Contributor

ANALYZE TABLE db_ip2738.ldl_cohort_with_tests COMPUTE STATISTICS 

 

 

this return nothing in hive. However I ran the hdfs command and got two sizes back. the output looke like this:

 

hdfs dfs -du -s -h hdfs://hdpprd/data/prod/users/ip2738/ldl_cohort_with_tests

 

result:  2.9 G     8.8 G       hdfs://hdpprd/data/prod/users/ip2738/ldl_cohort_with_tests

 

which number is the size of the table?

avatar
Expert Contributor

Since this is an external table (EXTERNAL_TABLE), Hive will not keep any stats on the table since it is assumed that another application is changing the underlying data at will.  Why keep stats if we can't trust that the data will be the same in another 5 minutes?  For a managed (non-external) table, data is manipulated through Hive SQL statements (LOAD DATA, INSERT, etc.) so the Hive system will know about any changes to the underlying data and can update the stats accordingly.

 

Using the HDFS utilities to check the directory file sizes will give you the most accurate answer.

avatar
Champion
What does hdfs dfs -du -s -h /path/to/table output?

avatar
Contributor

i got the output. Thanks very much for all your help,

avatar
Contributor
448 [GB] hdfs://aewb-analytics-staging-name.example.com:8020/user/hive/warehouse/mybigtable
 
8 [GB]hdfs://aewb-analytics-staging-name.example.com:8020/user/hive/warehouse/anotherone
 
0 [GB]hdfs://aewb-analytics-staging-name.example.com:8020/user/hive/warehouse/tinyone