Created 02-14-2017 06:48 AM
I have a query(ex: below) for which I want to get the statistics like how many rows it results, what is the size of a row.
SELECT t1.c1 ,t1.c2 FROM t1 WHERE NOT (t1.c1 IS NULL );
Is there any way to get it?
I tried "explain extended" as below, which is not printing the details that I want.
Explain Extended SELECT t1.c1 ,t1.c2 FROM t1 WHERE NOT (t1.c1 IS NULL );
Any help is appreciated.
Created 02-14-2017 07:13 AM
You need to use "describe extended <table_name>;" as below:
describe extended tableex5 ;
| Detailed Table Information | Table(tableName:tableex5, dbName:default, owner:hive, createTime:1487032307, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col1, type:string, comment:null), FieldSchema(name:col2, type:int, comment:null)], location:hdfs://ssnode253stats.openstacklocal:8020/apps/hive/warehouse/tableex5, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe, parameters:{serialization.format=1, field.delim=%|%}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{totalSize=24, numRows=2, rawDataSize=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=1, transient_lastDdlTime=1487032937}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE) |
Created 02-14-2017 07:17 AM
Thanks for your reply.
I need these details on a query rather than on a table. Hope you got my point.
Created 02-14-2017 08:27 AM
Hi @Srikanth Puli, you cannot get this information before running the your query. Statistics can return only general information about the table, like how many records, total size in bytes, etc. How can anyone know how many non-null c1 fields are there without running the query? And how about non-null c1 and non-null c2 and ... many other conditions? So, please run the query to find out: SELECT COUNT(*), MAX(LENGTH(t1.c1)) FROM t1 WHERE t1.c1 IS NOT NULL; provided that c1 is string.
Created 02-14-2017 03:49 PM
Please, have a look at this: https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ExistingTables–ANALYZE
For a non-partitioned table, you can issue the command:
ANALYZE TABLE Table1 COMPUTE STATISTICS FOR COLUMNS;
to gather column statistics of the table (Hive 0.10.0 and later).
If Table1 is a partitioned table, then for basic statistics you have to specify partition specifications like above in the analyze statement. Otherwise a semantic analyzer exception will be thrown.
However for column statistics, if no partition specification is given in the analyze statement, statistics for all partitions are computed.
You can view the stored statistics by issuing the DESCRIBE command. Statistics are stored in the Parameters array. Suppose you issue the analyze command for the whole table Table1, then issue the command:
DESCRIBE EXTENDED TABLE1;
then among the output, the following would be displayed:
... , parameters:{numPartitions=4, numFiles=16, numRows=2000, totalSize=16384, ...}, ....
I hope this helps.
Created 03-02-2017 11:09 PM
For getting the number of rows in result set you could use this query
Select count(*)
from
(SELECT t1.c1 ,t1.c2 FROM t1 WHERE NOT (t1.c1 IS NULL )) AS table_1;