Member since
09-28-2015
22
Posts
5
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
879 | 01-04-2018 09:27 PM | |
818 | 01-02-2018 05:06 AM | |
917 | 12-13-2017 10:27 PM | |
2047 | 11-30-2017 05:38 PM |
05-15-2018
06:36 PM
"_col0" is an internal/generated column. If you run a "select count(*) from foo" hive does not have to read any columns from the table, it just needs to count the records. What the explain plan says is that the table scan and select operator first generate empty records from the table. Then the group by counts those and stores that in a generated column "_col0". (There is a second group by because Hive has to aggregate all the results from the different "mappers".) If you just count all rows it's odd to run out of memory. What's the actual query and failure you're seeing?
... View more
01-04-2018
09:27 PM
There is some information here about data movement: https://hortonworks.com/blog/writing-a-tez-inputprocessoroutput-2/ Tez is pluggable and has different data transfer paradigms, but in general things are kept in memory until size constraints cause flushes to local disk. When tasks are not on the same node data will be transferred over the network (out of band data movement events involving the AM + direct data transfer between the nodes.)
... View more
01-02-2018
05:06 AM
Are you sure that you picked up an updated jar when you changed the modifier of the class? Changing the modifier should have worked...
... View more
12-18-2017
01:24 AM
When you're joining two tables and all the data of both tables is sorted by the join column, hive can use this to simplify the join algorithm (it will use a merge join). What the actual fastest way for a join is depends on other factors as well though (sizes of the tables, etc).
... View more
12-15-2017
11:28 PM
If you can control the insert statements that are being run, you might be able to write a multi insert statement that writes these occurrences to a second table without any additional overhead. (With multi insert statements you can write rows to multiple different tables while processing a single statement). You can add counters also via the Tez API. You would have to do so in a UDF and invoke this UDF for every row you're inserting. You would also have to get the result counters from ATS or tez UI. Overall this seems more complex.
... View more
12-15-2017
04:19 AM
I couldn't find a document describing all of them. I think the hope of the community was that the names are descriptive enough. Some of them are the same names/behavior as Map Reduce and if you google there are some list that describe them. Are there any that make the least sense?
... View more
12-14-2017
10:56 PM
Check the explain plan of both. I believe the distinct is re-written to a group-by by the planner.
... View more
12-14-2017
06:48 PM
This: insert into <tname> partition pcol = "01/02/.." select <cols> from <tname> where pcol = "01/01/2017" You could also copy the partition folder and use "alter table add partition" the data files don't contain the partition column.
... View more
12-13-2017
10:27 PM
For MR you should be able to get the counters from the Yarn UI for the MR job that ran as a result of the query.
... View more
12-13-2017
10:23 PM
I think Hive can treat ignite as file system and reads/writes will go through it. You might also want to check the ignite forums. https://apacheignite.readme.io/v1.3/docs/running-apache-hive-over-ignited-hadoop
... View more
12-13-2017
10:11 PM
1 Kudo
I think by default "hive -f <filename>" fails on first error and won't continue. Errors should end up on stderr so you can redirect that.
... View more
12-13-2017
12:23 AM
Take a look at the explain plan. The join type matters a lot. Stats collection typically helps CBO pick the right join. Full shuffle join is typically slowest, distributed hash join is faster, broad cast join typically faster still. If you run into memory issues either use the distributed hash join or you could try bucketing.
... View more
12-13-2017
12:17 AM
Hive is better at bulk inserts. "load data", "add partition", even "merge" are all faster because they operate in bulk. or creating an external table over hdfs files and doing a single "insert into <> select ...". Can you use any of those?
... View more
12-07-2017
05:59 PM
Does this work for you: https://community.hortonworks.com/articles/131583/how-to-connect-to-hive-using-odbc-in-tableau.html
... View more
12-07-2017
05:56 PM
1 Kudo
The general thinking is that external tables are not owned by hive and files/folders can change outside of hive's control. You can't really provide transactional guarantees under these circumstances, which is why the restriction is there. Do you really need to mark those tables external? Can you just convert them to internal?
... View more
12-04-2017
11:09 PM
I see. This is going to be inefficient, because Hive can't shuffle the data based on keys. You can do it like this probably: select ... from table_1, table_2 where fuzzy_match(column_x, column_y) (inner join branch) union all select ... from table_1 where not exists (select 1 from table_2 where fuzzy_match(column_x, column_y)) s1; (left outer branch) With this: https://issues.apache.org/jira/browse/HIVE-14731 you will at least not have single node cross products, but it will still be an expensive operation.
... View more
12-04-2017
06:30 PM
typical pattern is: select * from table_1 left join table_2 on (lcase(column_x) = lcase(column_y)) can you produce a canonical format on both sides (like lcase above) and join on that?
... View more
11-30-2017
09:30 PM
Describe extended should print the contraints: https://issues.apache.org/jira/browse/HIVE-13598
... View more
11-30-2017
06:17 PM
I'm not aware of anything like that. However, you might have more luck using explain to understand the individual vertices. With "hive.tez.exec.print.summary=true" you can see a summary of the number of records that flow between vertices. The Tez View has some visualizations of this data.
... View more
11-30-2017
06:05 PM
I don't think there are any audit tables, but stats are there for tables, partitions and columns. You can get at the statistics information via the describe commands (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Describe).
... View more
11-30-2017
05:38 PM
2 Kudos
> Unknown metadata storage type [derby] Seems you're using derby for the druid catalog. The druid handler only supports mysql and postgres at the moment.
... View more
11-29-2017
11:06 PM
1 Kudo
Yes. Whether you're bucketing your tables or not, the encoding of the data will be more optimal for fast scans with ORC (as opposed to row-major formats, etc).
... View more