Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Hive Warehouse Connector - Not all data is loaded from clustered Hive tables after a delete

Explorer

Hi guys,

 

Recently I noticed that after a DELETE in a clustered Hive table, the SELECT using Hive Warehouse Connector (HWC) does not load all the data.

 

It seems like that some buckets files are ignored after the DELETE.

 

In the beeline all the data continues to be returned correctly.

 

I'm using HDP 3.1.5 with Spark2 2.3.2, Hive 3.1.0 and hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar.

 

See this example:

 

Hive >>

 

CREATE TABLE mdm_dev.test( 
`id_a` string, 
`num_b` string, 
`nom_b` string, 
`dat_c` string, 
`ind_c` string, 
`ind_b` string, 
`id_c` int) 
CLUSTERED BY ( 
num_b) 
INTO 64 BUCKETS
STORED AS ORC 
TBLPROPERTIES ('transactional'='true');

INSERT INTO mdm_dev.test (SELECT * FROM mdm_dev.original);

ANALYZE TABLE mdm_dev.test COMPUTE STATISTICS;

ANALYZE TABLE mdm_dev.test COMPUTE STATISTICS FOR COLUMNS;

SELECT COUNT(*) FROM mdm_dev.test; 
180978025

 


Spark >>

 

import com.hortonworks.hwc.HiveWarehouseSession

val hive = HiveWarehouseSession.session(spark).build()

hive.executeQuery(" SELECT * FROM mdm_dev.test ").count
180978025

 


Up to this point everyting is OK. Now, let's DELETE some data.

 

Hive >>

 

SELECT id_c, COUNT(*) FROM mdm_dev.test GROUP BY id_c;
42306 | 953595
...

DELETE FROM mdm_dev.test WHERE id_c = 42306;

SELECT COUNT(*) FROM mdm_dev.test;
180024430

 

 

Spark >>

 

import com.hortonworks.hwc.HiveWarehouseSession

val hive = HiveWarehouseSession.session(spark).build()

hive.executeQuery(" SELECT * FROM mdm_dev.test ").count
37577751

 


It's missing a lot of data in the dataframe =/

 

Looking for missing rows I noticed that some bucket files have not been read.

1 REPLY 1

New Contributor

Hi !

 

Did you find some explanations regarding you issue ?

 

I faced a similar problem using HWS executeQuery that didn't retrieve expected rows compared with beeline. My table is defined as ORC with transactional set to True.

NB: executeQuery worked "forcing" the whol repo scan by using of a specific "WHERE clause" on a given field of the table

 

Regards

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.