About ghagleitner

ghagleitner · ‎05-15-2018

"_col0" is an internal/generated column. If you run a "select count(*) from foo" hive does not have to read any columns from the table, it just needs to count the records. What the explain plan says is that the table scan and select operator first generate empty records from the table. Then the group by counts those and stores that in a generated column "_col0". (There is a second group by because Hive has to aggregate all the results from the different "mappers".) If you just count all rows it's odd to run out of memory. What's the actual query and failure you're seeing?

ghagleitner · ‎01-04-2018

There is some information here about data movement: https://hortonworks.com/blog/writing-a-tez-inputprocessoroutput-2/ Tez is pluggable and has different data transfer paradigms, but in general things are kept in memory until size constraints cause flushes to local disk. When tasks are not on the same node data will be transferred over the network (out of band data movement events involving the AM + direct data transfer between the nodes.)

ghagleitner · ‎01-02-2018

Are you sure that you picked up an updated jar when you changed the modifier of the class? Changing the modifier should have worked...

ghagleitner · ‎12-14-2017

Check the explain plan of both. I believe the distinct is re-written to a group-by by the planner.

ghagleitner · ‎12-14-2017

This: insert into <tname> partition pcol = "01/02/.." select <cols> from <tname> where pcol = "01/01/2017" You could also copy the partition folder and use "alter table add partition" the data files don't contain the partition column.

ghagleitner · ‎12-13-2017

For MR you should be able to get the counters from the Yarn UI for the MR job that ran as a result of the query.

ghagleitner · ‎12-04-2017

I see. This is going to be inefficient, because Hive can't shuffle the data based on keys. You can do it like this probably: select ... from table_1, table_2 where fuzzy_match(column_x, column_y) (inner join branch) union all select ... from table_1 where not exists (select 1 from table_2 where fuzzy_match(column_x, column_y)) s1; (left outer branch) With this: https://issues.apache.org/jira/browse/HIVE-14731 you will at least not have single node cross products, but it will still be an expensive operation.

ghagleitner · ‎12-04-2017

typical pattern is: select * from table_1 left join table_2 on (lcase(column_x) = lcase(column_y)) can you produce a canonical format on both sides (like lcase above) and join on that?

ghagleitner · ‎11-30-2017

> Unknown metadata storage type [derby] Seems you're using derby for the druid catalog. The druid handler only supports mysql and postgres at the moment.

Online	Offline
Last Visited	‎12-19-2018 10:15 PM

Member Since	‎09-28-2015 05:07 PM
Last Visited	‎12-19-2018 10:15 PM
Posts	22
Kudos received	5

Cloudera Community

Re: [TEZ] where are stored intermediates result ?

Re: SimpleUDFExample and Error - GenericUDFBridge ...

Re: Hive Logs: View counters for "MR" execution en...

Re: Stuck as Hive- DruidIntegration

Re: What is _col0 in hive explain plan?

Re: [TEZ] where are stored intermediates result ?

Re: SimpleUDFExample and Error - GenericUDFBridge ...

Re: distinct vs group by

Re: Duplicating data in hive table

Re: Hive Logs: View counters for "MR" execution en...

Re: Fuzzy Match Join Multiple Tables

Re: Fuzzy Match Join Multiple Tables

Re: Stuck as Hive- DruidIntegration