Created on 01-13-2017 12:25 PM - edited 09-16-2022 03:54 AM
I have many tables in Hive and suspect size of these tables are causing space issues on HDFS FS. Is there a way to check the size of Hive tables? If so - how?
hive> show tables;
OK
bee_actions
bee_bills
bee_charges
bee_cpc_notifs
bee_customers
bee_interactions
bee_master_03jun2016_to_17oct2016
bee_master_18may2016_to_02jun2016
bee_master_18oct2016_to_21dec2016
bee_master_20160614_021501
bee_master_20160615_010001
bee_master_20160616_010001
bee_master_20160617_010001
bee_master_20160618_010001
bee_master_20160619_010001
bee_master_20160620_010001
bee_master_20160621_010002
bee_master_20160622_010001
bee_master_20160623_010001
bee_master_20160624_065545
bee_master_20160625_010001
bee_master_20160626_010001
bee_master_20160627_010001
bee_master_20160628_010001
bee_master_20160629_010001
bee_master_20160630_010001
bee_master_20160701_010001
bee_master_20160702_010001
bee_master_20160703_010001
bee_master_20160704_010001
bee_master_20160705_010001
bee_master_20160706_010001
bee_master_20160707_010001
bee_master_20160707_040048
bee_master_20160708_010001
bee_master_20160709_010001
bee_master_20160710_010001
bee_master_20160711_010001
bee_master_20160712_010001
bee_master_20160713_010001
bee_master_20160714_010001
bee_master_20160715_010002
bee_master_20160716_010001
bee_master_20160717_010001
bee_master_20160718_010001
bee_master_20160720_010001
bee_master_20160721_010001
bee_master_20160723_010002
bee_master_20160724_010001
bee_master_20160725_010001
bee_master_20160726_010001
bee_master_20160727_010002
bee_master_20160728_010001
bee_master_20160729_010001
bee_master_20160730_010001
bee_master_20160731_010001
bee_master_20160801_010001
bee_master_20160802_010001
bee_master_20160803_010001
Created 01-13-2017 01:40 PM
Created 01-13-2017 01:40 PM
Created 01-13-2017 02:07 PM
Thanks so much for your prompt reply.
I ran the suggested command but i see size as 0 whereas i know it has some data. So what does that mean?
hive> describe extended bee_master_20170113_010001
> ;
OK
entity_id string
account_id string
bill_cycle string
entity_type string
col1 string
col2 string
col3 string
col4 string
col5 string
col6 string
col7 string
col8 string
col9 string
col10 string
col11 string
col12 string
Detailed Table Information Table(tableName:bee_master_20170113_010001, dbName:default, owner:sagarpa, createTime:1484297904, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:entity_id, type:string, comment:null), FieldSchema(name:account_id, type:string, comment:null), FieldSchema(name:bill_cycle, type:string, comment:null), FieldSchema(name:entity_type, type:string, comment:null), FieldSchema(name:col1, type:string, comment:null), FieldSchema(name:col2, type:string, comment:null), FieldSchema(name:col3, type:string, comment:null), FieldSchema(name:col4, type:string, comment:null), FieldSchema(name:col5, type:string, comment:null), FieldSchema(name:col6, type:string, comment:null), FieldSchema(name:col7, type:string, comment:null), FieldSchema(name:col8, type:string, comment:null), FieldSchema(name:col9, type:string, comment:null), FieldSchema(name:col10, type:string, comment:null), FieldSchema(name:col11, type:string, comment:null), FieldSchema(name:col12, type:string, comment:null)], location:hdfs://cmilcb521.amdocs.com:8020/user/insighte/bee_data/bee_run_20170113_010001, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim= , serialization.format=
Time taken: 0.328 seconds, Fetched: 18 row(s)
hive> describe formatted bee_master_20170113_010001
> ;
OK
# col_name data_type comment
entity_id string
account_id string
bill_cycle string
entity_type string
col1 string
col2 string
col3 string
col4 string
col5 string
col6 string
col7 string
col8 string
col9 string
col10 string
col11 string
col12 string
# Detailed Table Information
Database: default
Owner: sagarpa
CreateTime: Fri Jan 13 02:58:24 CST 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://cmilcb521.amdocs.com:8020/user/insighte/bee_data/bee_run_20170113_010001
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE false
EXTERNAL TRUE
numFiles 0
numRows -1
rawDataSize -1
totalSize 0
transient_lastDdlTime 1484297904
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.081 seconds, Fetched: 48 row(s)
hive> describe formatted bee_ppv;
OK
# col_name data_type comment
entity_id string
account_id string
bill_cycle string
ref_event string
amount double
ppv_category string
ppv_order_status string
ppv_order_date timestamp
# Detailed Table Information
Database: default
Owner: sagarpa
CreateTime: Thu Dec 22 12:56:34 CST 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://cmilcb521.amdocs.com:8020/user/insighte/bee_data/tables/bee_ppv
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
EXTERNAL TRUE
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1484340138
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.072 seconds, Fetched: 40 row(s)
Created 01-17-2017 05:38 PM
Created 01-17-2017 10:59 PM
Created 12-06-2017 05:16 PM
ANALYZE TABLE db_ip2738.ldl_cohort_with_tests COMPUTE STATISTICS
this return nothing in hive. However I ran the hdfs command and got two sizes back. the output looke like this:
hdfs dfs -du -s -h hdfs://hdpprd/data/prod/users/ip2738/ldl_cohort_with_tests
result: 2.9 G 8.8 G hdfs://hdpprd/data/prod/users/ip2738/ldl_cohort_with_tests
which number is the size of the table?
Created 01-09-2018 12:00 PM
Since this is an external table (EXTERNAL_TABLE), Hive will not keep any stats on the table since it is assumed that another application is changing the underlying data at will. Why keep stats if we can't trust that the data will be the same in another 5 minutes? For a managed (non-external) table, data is manipulated through Hive SQL statements (LOAD DATA, INSERT, etc.) so the Hive system will know about any changes to the underlying data and can update the stats accordingly.
Using the HDFS utilities to check the directory file sizes will give you the most accurate answer.
Created 01-16-2017 08:48 PM
Created 01-17-2017 01:40 PM
i got the output. Thanks very much for all your help,
Created 01-17-2017 05:16 PM
448 [GB] hdfs://aewb-analytics-staging-name.example.com:8020/user/hive/warehouse/mybigtable
8 [GB]hdfs://aewb-analytics-staging-name.example.com:8020/user/hive/warehouse/anotherone
0 [GB]hdfs://aewb-analytics-staging-name.example.com:8020/user/hive/warehouse/tinyone