Created 07-09-2018 09:23 AM
We have encountered a behavior in Hive where ANALYZE queries change the locations of table partitions to those of another table.
ANALYZE log:
INFO [HiveServer2-Background-Pool: <thread_id>]: ql.Driver (Driver.java:execute) - Executing command(queryId=<query_id>): ANALYZE TABLE <table_A_name> PARTITION(<partition_info>) COMPUTE STATISTICS .... INFO [HiveServer2-Background-Pool: <thread_id>]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo) - 94048: alter_partitions : db=<db_name> tbl=<table_A_name> INFO [HiveServer2-Background-Pool: <thread_id>]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent) - ugi=<ugi> ip=unknown-ip-addr cmd=alter_partitions : db=<db_name> tbl=<table_A_name> INFO [HiveServer2-Background-Pool: <thread_id>]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions) - New partition values:[<partition_value>] INFO [HiveServer2-Background-Pool: <thread_id>]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions) - New partition values:[<partition_value>] INFO [HiveServer2-Background-Pool: <thread_id>]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions) - New partition values:[<partition_value>]
Results:
hive> select distinct(INPUT__FILE__NAME) from <table_A_name> where <partition_column> BETWEEN <value_A> AND <value_B>; hdfs://.../<db_name>/<table_A_name>/<partition>=<value>/xxxxxx_x ... hdfs://.../<db_name>/<table_A_name>/<partition>=<value>/xxxxxx_x ... hdfs://.../<db_name>/<table_B_name>/<partition>=<value>/xxxxxx_x ... hdfs://.../<db_name>/<table_B_name>/<partition>=<value>/xxxxxx_x
Is this a known issue with Hive? We would appreciate workarounds if any are available.
Hive Version: 2.0.0
Created 07-09-2018 02:58 PM
I'm not sure what causes the problem; I've tried to reproduce you issue without success:
drop table if exists t; create table t (a int) partitioned by (p string); insert into t partition(p=1) values(1); -- runs ok analyze table t partition(p=1) compute statistics; -- throws exception for me right now analyze table t partition(p=2) compute statistics;
Could you please help extend this minimal sample case into a testcase which reproduces your issue? Could you give some more precise information about the hive you are using (full version number)... - if you are using Apache Hive 2.0.0; I would like recommend to try out either 2.3.3 or 3.0.0
Created 07-11-2018 08:15 AM
Thank you for your response.
Could you please help extend this minimal sample caseinto a testcase which reproduces your issue?
We are currently investigating the problem, and so far:
A similar observation can be found here
if you are using Apache Hive 2.0.0; I would like recommend to try out either 2.3.3 or 3.0.0
A version upgrade is not an immediate option because it will take time to implement.
We would like to know whether this is a current issue or has a JIRA ticket?
Created 07-11-2018 11:20 AM
the "alter table" which is happening should be okay; since that call stores some results of the analyze
I don't know about any tickets which records that it could change the location by mistake..
But according to the article you've found; this seems to have happened to other people as well...
All I can think about right now is that something is not fully thread safe.
Is there by any chance multiple analyze statements running at the same time?
Are you using standalone hive-cli or beeline/jdbc with hiveserver2 ?
Created 07-19-2018 08:18 AM
@zaindrich
Apologies for our delayed response.
Are you using standalone hive-cli or beeline/jdbc with hiveserver2
hiveserver2 was used for these queries.
Is there by any chance multiple analyze statements running at the same time
We investigated the logs and found other threads running analyze statements.
One thread was of interest. This was running at the same time as the thread in my original post:
INFO [HiveServer2-Background-Pool: <Thread-2>]: ql.Driver (Driver.java:execute(1515)) - Executing command(queryId=<query_id>): ANALYZE TABLE <table_B_name> PARTITION(<partition_info>) COMPUTE STATISTICS ... INFO [HiveServer2-Background-Pool: <Thread-2>]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(669)) - 94045: get_table : db=<db_name> tbl=<table_B_name> ... INFO [HiveServer2-Background-Pool: <Thread-2>]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(280)) - ugi=<ugi> ip=unknown-ip-addr cmd=get_table : db=<table_B_name> ... INFO [HiveServer2-Background-Pool: <Thread-2>]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(669)) - 94045: alter_partitions : db=<db_name> tbl=<table_A_name> INFO [HiveServer2-Background-Pool: <Thread-2>]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(280)) - ugi=<ugi> ip=unknown-ip-addr cmd=alter_partitions : db=<db_name> tbl=<table_A_name> INFO [HiveServer2-Background-Pool: <Thread-2>]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions(3246)) - New partition values:[<partition_value>] INFO [HiveServer2-Background-Pool: <Thread-2>]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions(3246)) - New partition values:[<partition_value>] INFO [HiveServer2-Background-Pool: <Thread-2>]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions(3246)) - New partition values:[<partition_value>]
In Thread-2, ANALYZE was run on Table B, but alter_partitions was done on Table A.
We had to redact some information but: