We are using CDH 5.5.1, we have some incrmental data ingestion happening on hourly basis. After incrmental ingestion is done, we usually run compute stats command on our tables, to optimize queries for front end. But we are noticing it is taking more than hour to complete compute stats on 1 table which has data approx.....of 500 GB.
Is there anyway we can make compute stats or compute incrmental stats run faster or quicker ?
I recommend that you consider updating the column stats (#distinct values per column), and the table stats (table/partition row counts) separately. Compute stats is mostly expensive due to computing column stats, but the number of distinct values typically changes much slower than the row count.
What you can do is run the full compute stats less frequently (e.g., once your table size has doubled).
You can update the table stats (row counts) in a much cheaper way by running select count(*) and using ALTER TABLE to manually set the new row count.