Created 12-15-2016 12:48 PM
Can someone help me explain what the difference is between these 2 hive analyze commands:
analyze table svcrpt.predictive_customers compute statistics; analyze table svcrpt.predictive_customers compute statistics for columns;
What more does the "for columns" part do?
Created 12-15-2016 01:02 PM
1. analyze table svcrpt.predictive_customers compute statistics;
will compute basic stats of the table like numFiles, numRows, totalSize, rawDataSize in the table, these are stored in
TABLE_PARAMS table under hive metastore db.
2. analyze table svcrpt.predictive_customers compute statistics for columns;
create/update column level stats like NUM_DISTINCTS,LOW_VALUE,HIGH_VALUE,NUM_NULLS etc in TAB_COL_STATS table under metastore db
Created 12-15-2016 01:02 PM
1. analyze table svcrpt.predictive_customers compute statistics;
will compute basic stats of the table like numFiles, numRows, totalSize, rawDataSize in the table, these are stored in
TABLE_PARAMS table under hive metastore db.
2. analyze table svcrpt.predictive_customers compute statistics for columns;
create/update column level stats like NUM_DISTINCTS,LOW_VALUE,HIGH_VALUE,NUM_NULLS etc in TAB_COL_STATS table under metastore db
Created 12-15-2016 01:06 PM
Got it, thanks! Does the for columns command also do the basic stats that the first analyze command does, or would I have to run them both to get both sets of stats computed?
Created 12-15-2016 01:18 PM
with columns stats you will be able to update basic stat also
Created 12-15-2016 01:24 PM
Thanks. I just did my own testing to see if "for columns" would also update TABLE_PARAMS table and I found that it did not.
For instance, when I run "analyze table svcrpt.predictive_customers compute statistics;" the column transient_lastDdlTime in the table TABLE_PARAMS gets updated, but if I run "analyze table svcrpt.predictive_customers compute statistics for columns;" transient_lastDdlTime does not updated.
So does this mean "for columns" does not update the basic stats?