About mauricio

mauricio · ‎07-29-2019

Thanks very much Tim, I can confirm that it works like a charm, even with the group by, so yeah docs should be updated because that does add a lot of value vs. what was documented. P.S. I didn't get an email when you first replied, only yesterday with the latest ones. thanks for the quick response.

mauricio · ‎07-24-2019

We have a slow query like: select max(partition_col_1) from some_table where partition_col_2 = 'x' and it's super slow, scanning all records (hundreds of billions) in the filtered partitions, even though it's actually not getting anything out of them... select only includes partitioning column. Absolutely no need to read any files I don't think. Any way or hint to get around this?

mauricio · ‎04-18-2019

I just wanted to add to Todd's suggestion: also if you have CM, you can create a new chart with this query: "select total_kudu_on_disk_size_across_kudu_replicas where category=KUDU_TABLE", and it will plot all your table sizes, plus the graph detail will list current values for all entries. Probably not easily scriptable, but at least a way to quickly copy all sizes in one go, looking like this: 7.2T impala::<tablename_redacted> (Kudu) 9.8T impala::<tablename_redacted> (Kudu) 6.5T impala::<tablename_redacted> (Kudu) 4.1G impala::<tablename_redacted> (Kudu) 21.5G impala::<tablename_redacted> (Kudu) 15.2G impala::<tablename_redacted> (Kudu) 6.1T impala::<tablename_redacted> (Kudu) 98G impala::<tablename_redacted> (Kudu) 23.2G impala::<tablename_redacted> (Kudu) 10G impala::<tablename_redacted> (Kudu) 9.1G impala::<tablename_redacted> (Kudu) 1.2T impala::<tablename_redacted> (Kudu) 7.5G impala::<tablename_redacted> (Kudu) 2.6T impala::<tablename_redacted> (Kudu) 35.8T impala::<tablename_redacted> (Kudu)

mauricio · ‎12-13-2018

For Impala, hbase, hdfs and yarn services, I can specify memory allocation in static service pools UI in Cloudera Manager. Not so for Kudu. So do I manually under-allocate all the others to leave open whatever I want for Kudu, and then config Kudu's "Tablet Server Hard Memory Limit"? CDH 5.15.1 CM 5.15.1

mauricio · ‎06-12-2018

Never mind my last comment: I was confused because the DISABLE_CODEGEN_ROWS_THRESHOLD setting @Tim Armstrong recommended was not yet documented, so tried using the closest thing I found (SCAN_NODE_CODEGEN_THRESHOLD) which wasn't applicable to our query. Turns out even though not yet documented, DISABLE_CODEGEN_ROWS_THRESHOLD is available and works as Tim suggested, in our CDH 5.13 cluster.

mauricio · ‎06-07-2018

FYI @Tim Armstrong : sadly, setting SCAN_NODE_CODEGEN_THRESHOLD, to any value, had no effect, perhaps since as I mentioned above the slow codegen is NOT in a scan node but a TOP-N towards the end of processing. We are considering setting DISABLE_CODEGEN=false on the url for this connection alone (specific to user reports), though we'd need to watch carefully to make sure it doesn't make other reports slow. We'll probably also open a case with our EDH support to try to get to the bottom of why it's slow to begin with.

mauricio · ‎06-06-2018

Right! OK, will do that then. Thanks Tim.

mauricio · ‎06-06-2018

Thanks @Tim Armstrong. Hmm I can't find that option in the current docs, is it just undocumented? Or do you mean SCAN_NODE_CODEGEN_THRESHOLD ? because there is at least 1 node (from an often used dimension that will apply to most queries) where rows estimate is 2.6 million (though after filtering it becomes only a few). And also even if all scans are under 400K or whatever we set it to, will it help here considering the slow codegen is in a TOP-N step towards the end? Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ... 03:SCAN HDFS 30 48.332ms 103.898ms 17 2.60M 10.93 MB 192.00 MB irdw_prod.media_dim md

mauricio · ‎06-05-2018

Yeah we definitely wouldn't want to do globally. We tried to do set DISABLE_CODEGEN=true; right before our sql in the report but driver fails with a [Simba][JDBC](11300) A ResultSet was expected but not generated which is really sad, I had thought we could specify any of these hints right in the sql. Doing so in the jdbc url is not an option because same connection is shared by all of our thousands of reports, only 10% or so of which are affected by this. @Tim Armstrong I tried to guess your Cloudera email and sent you the profile directly.

mauricio · ‎06-05-2018

Thanks, right I know I can do that but I'm hoping to figure out the root cause rather than paper over it. Plus it makes me nervous to do so for a whole class of queries/reports.. that doc page does say "... Do not otherwise run with this setting turned on, because it results in lower overall performance.

Online	Offline
Last Visited	‎07-05-2020 12:33 PM

Member Since	‎12-13-2013 04:54 PM
Last Visited	‎07-05-2020 12:33 PM
Posts	39
Kudos received	7

Cloudera Community

Re: Limit number of parquet files when doing an in...

Re: Avoiding hdfs scan when querying only partitio...

Avoiding hdfs scan when querying only partition co...

Re: kudu table size

How to specify memory allocation to Kudu in CM Sta...

Re: Very slow CodeGen taking 80% of runtime

Re: Very slow CodeGen taking 80% of runtime

Re: Very slow CodeGen taking 80% of runtime

Re: Very slow CodeGen taking 80% of runtime

Re: Very slow CodeGen taking 80% of runtime

Re: Very slow CodeGen taking 80% of runtime