Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Loading bucketed table ignores or skips rows

avatar

The following code works unless you change the bucketed column to Id or AccountType. When using those columns the job processes without any errors but some of the rows are dropped or ignored and do not end up in the accounts table. I started out using AccountType and 1.9MM rows were dropped every time I ran the job (20 - 30 times). I changed to Id after troubleshooting and eliminating other issues and found that only 32k rows were dropped. Finally I tried bucketing on Created (timestamp) and all rows were loaded into accounts table. Looking at the data that was skipped I could not see any pattern that would point to a specific bucket file.

DROP TABLE IF EXISTS  ingest.txt_accounts;
CREATE TABLE IF NOT EXISTS ingest.txt_accounts (
    Id BIGINT
    ,Created TIMESTAMP
    ,Modified TIMESTAMP
    ,Status VARCHAR(50)
    ,ActivationDate TIMESTAMP
    ,StatusReason VARCHAR(50)
    ,ApplicationId BIGINT
    ,IsFraud INT
    ,LimitPackage VARCHAR(50)
    ,AccountType VARCHAR(50)
    ,IsEmployeeAccount INT
    ,IsOneTimeFeePaid INT
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");  


LOAD DATA INPATH 'adl://datalake.azuredatalakestore.net/LandingZone/Accounts'  
OVERWRITE INTO TABLE ingest.txt_accounts;  

DROP TABLE IF EXISTS ingest.accounts;
CREATE TABLE IF NOT EXISTS ingest.accounts (
    Id BIGINT
    ,Created TIMESTAMP
    ,Modified TIMESTAMP
    ,Status VARCHAR(50)
    ,ActivationDate TIMESTAMP
    ,StatusReason VARCHAR(50)
    ,ApplicationId BIGINT
    ,IsFraud INT
    ,LimitPackage VARCHAR(50)
    ,AccountType VARCHAR(50)
    ,IsEmployeeAccount INT
    ,IsOneTimeFeePaid INT
) 
CLUSTERED BY (Created) INTO 10 BUCKETS
STORED AS ORC
TBLPROPERTIES("transactional" = "true"); 

INSERT INTO TABLE ingest.accounts
SELECT 
     Id
    ,Created 
    ,Modified
    ,Status 
    ,ActivationDate 
    ,StatusReason 
    ,ApplicationId 
    ,IsFraud 
    ,LimitPackage 
    ,AccountType 
    ,IsEmployeeAccount
    ,IsOneTimeFeePaid 
 FROM ingest.txt_accounts;

1 ACCEPTED SOLUTION

avatar

The table statistics are used to provide the count of rows. I was using SELECT COUNT(*) FROM... to validate my loads which was reporting the wrong number of rows. I went through the TEZ logs and saw that my final loads were reporting the correct number of rows in the counters.

Bug: https://issues.apache.org/jira/browse/HIVE-11266

View solution in original post

7 REPLIES 7

avatar
Contributor
@Sean Anderson

This sounds interesting. Could you share the product version you are using, which client is used here to connect to Hive and also the ACID settings you perform on your client with the "SET .." option.

avatar

I'm on Azure HDInsight 3.5.1 which deploys HDP 2.5. I'm using the Spark cluster on Linux VMs.

These are the settings added to the ARM template that I use to deploy the cluster to enable ACID support.

"hive.support.concurrency": true,

"hive.enforce.bucketing": true,

"hive.exec.dynamic.partition.mode": "nonstrict",

"hive.txn.manager": "org.apache.hadoop.hive.ql.lockmgr.DbTxnManager",

"hive.compactor.initiator.on": true,

"hive.compactor.worker.threads": 1

I'm not using any SET commands on the client. I use Visual Studio Azure SDK and Ambari to run queries and I have tested with both.

avatar
Contributor

Could you execute - "set hive.enforce.bucketing;" on your client and verify it is set to true. If not explicitly "set hive.enforce.bucketing=true;" on your client and try recreating the scenario again.

avatar

I just tried to reproduce the issue again before trying your suggestion to set hive.enforce.bucketing at the client and it is working as expected. I ran this code all day yesterday and didn't get the expected results once. I have a support ticket in with Microsoft so I will ask them if there were any issues with the system yesterday.

avatar
Super Guru

@Sean Anderson

Not related, but I noticed that you use heavily VARCHAR data type. That is a known performance issue. Try to use STRING instead.

Coming back to your issue, I assume that both columns (ID, AccountType) are not null, otherwise if they have null values, that could explain the difference on number of records dropped.

avatar

I will take your comments about STRING under advisement. None of the values for those columns is ever null for a row.

avatar

The table statistics are used to provide the count of rows. I was using SELECT COUNT(*) FROM... to validate my loads which was reporting the wrong number of rows. I went through the TEZ logs and saw that my final loads were reporting the correct number of rows in the counters.

Bug: https://issues.apache.org/jira/browse/HIVE-11266