Never hit this before until upgrading to CDH 5.7.
I have very large table with hundreds of billions of records. It's partitioned by YYYY, MM, DD, HOUR. Understandably a full compute stats takes eons to the point of being mostly unusable, so have always done compute incremental stats on each HOUR partition after they are added and populated.
But now that errors out with the following error:
ERROR: AnalysisException: Incremental stats size estimate exceeds 200.00MB. Please try COMPUTE STATS instead.
Is there a way to change that limit, or disable this entirely - otherwise how is one supposed to do an incremental stats on an extremely large table now?
Sorry you are running into this issue. The check you are hitting is a conservative safety precaution against OOMing in the JVM when serializing an array >1GB. See IMPALA-2648/IMPALA-2664 for details.
We use (#columns * #partitions * 400) to estimate the in-memory size of the incremental stats, so in the short term the only way to avoid this limit is to reduce the number of columns or partitions such that the estimated in-memory size is below the 200MB threshold.
I agree with you that ideally, we should make the limit configurable, or make changes to dramatically increase it.
Thank you for the information and filing the ticket. While this ticket is worked upon, are there any workarounds to compute stats for reasonably large tables? Running "compute stats" results in the query hanging without any indication of progress.
This issue is absolutely crippling us. We've had to stop computing stats on virtually all of our tables, because it simply takes far too long to do so on the entire table - and we get this error now for almost all incremental updates as well.
This effectively makes impala unusable at scale.
If there's any potential work around (sans dowgrading cluster) it would be greatly appreciated.
Sorry this is causing you so much pain. One workaround is to update the table and column stats manually.
The main idea is this: You can cut down the time for computing stats significantly by manually computing and setting the stats for those columns that actually need them. The columns used in predicates (including join predicates) should have stats. The column can be updated relatively infrequently. Setting the column stats manually is a relatively new feature.
The table stats can also be computed and set manually, so e.g., if you've just added a new partition you can do a count(*) on that partition and set the #rows manually.
You can find more info in the docs here:
Just realized the docs do not yet have info on manually setting column stats:
Here's the JIRA that enabled that feature, you can find the syntax in the comments. Feel free to ask questions if you are considering this workaround :)
I'll read up on this - but I think I would re-echo my prior sentiment... we've taken IMHO a huge step backwards here in terms of scalability.
I have hundreds of tables with thousands of partitions.
Compute incremental stats used to work BEAUTIFULLY. All of our code is set to use it, and was exponentially better than trying to do a full compute stats run. Net: We used to push a single button, and everything would just work.
Now however, we have to look at programatically trying to compute a bunch of statistics - and then manually update things... for hundreds of partitions, and thousands of tables. Net: We now have to build, and then manually push tens of thousands of buttons instead of just one. And then maybe it changes again in the next release. This is a huge, step backwards IMHO.
Just making that flag a variable or a configuration option at least sounds like a very easy. :)
Don't mean to sound like I'm complaining. But we had an amazing, wonderful ice cream cone that was fantastic - and someone just came and slapped it out of our hands. :)
Thank you for your colorful and honest feedback :). You've brought up a legitmiate concern around regressed behavior which is causing you pain. Very sorry.
I agree that making the limit configurable is the best course of action in the short term. We'll take a look.
We are facing the same issue with not being able to calculate table stats ( we run several large tables), is upgrading to impala 2.8 only fix?
we are running cloudera 5.9, will upgrading to impala 2.8 cause any issues?
Would think calculating table stats on large table is a common workflow for most clients.
Is it possible to get a patch of this on Impala 2.7 ?