Member since
09-28-2015
22
Posts
5
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1692 | 01-04-2018 09:27 PM | |
1428 | 01-02-2018 05:06 AM | |
1775 | 12-13-2017 10:27 PM | |
3533 | 11-30-2017 05:38 PM |
05-15-2018
06:36 PM
"_col0" is an internal/generated column. If you run a "select count(*) from foo" hive does not have to read any columns from the table, it just needs to count the records. What the explain plan says is that the table scan and select operator first generate empty records from the table. Then the group by counts those and stores that in a generated column "_col0". (There is a second group by because Hive has to aggregate all the results from the different "mappers".) If you just count all rows it's odd to run out of memory. What's the actual query and failure you're seeing?
... View more
01-04-2018
09:27 PM
There is some information here about data movement: https://hortonworks.com/blog/writing-a-tez-inputprocessoroutput-2/ Tez is pluggable and has different data transfer paradigms, but in general things are kept in memory until size constraints cause flushes to local disk. When tasks are not on the same node data will be transferred over the network (out of band data movement events involving the AM + direct data transfer between the nodes.)
... View more
01-02-2018
05:06 AM
Are you sure that you picked up an updated jar when you changed the modifier of the class? Changing the modifier should have worked...
... View more
12-14-2017
10:56 PM
Check the explain plan of both. I believe the distinct is re-written to a group-by by the planner.
... View more
12-14-2017
06:48 PM
This: insert into <tname> partition pcol = "01/02/.." select <cols> from <tname> where pcol = "01/01/2017" You could also copy the partition folder and use "alter table add partition" the data files don't contain the partition column.
... View more
12-13-2017
10:27 PM
For MR you should be able to get the counters from the Yarn UI for the MR job that ran as a result of the query.
... View more
12-04-2017
11:09 PM
I see. This is going to be inefficient, because Hive can't shuffle the data based on keys. You can do it like this probably: select ... from table_1, table_2 where fuzzy_match(column_x, column_y) (inner join branch) union all select ... from table_1 where not exists (select 1 from table_2 where fuzzy_match(column_x, column_y)) s1; (left outer branch) With this: https://issues.apache.org/jira/browse/HIVE-14731 you will at least not have single node cross products, but it will still be an expensive operation.
... View more
12-04-2017
06:30 PM
typical pattern is: select * from table_1 left join table_2 on (lcase(column_x) = lcase(column_y)) can you produce a canonical format on both sides (like lcase above) and join on that?
... View more
11-30-2017
05:38 PM
2 Kudos
> Unknown metadata storage type [derby] Seems you're using derby for the druid catalog. The druid handler only supports mysql and postgres at the moment.
... View more