Created 03-06-2017 05:02 PM
The following code works unless you change the bucketed column to Id or AccountType. When using those columns the job processes without any errors but some of the rows are dropped or ignored and do not end up in the accounts table. I started out using AccountType and 1.9MM rows were dropped every time I ran the job (20 - 30 times). I changed to Id after troubleshooting and eliminating other issues and found that only 32k rows were dropped. Finally I tried bucketing on Created (timestamp) and all rows were loaded into accounts table. Looking at the data that was skipped I could not see any pattern that would point to a specific bucket file.
DROP TABLE IF EXISTS ingest.txt_accounts; CREATE TABLE IF NOT EXISTS ingest.txt_accounts ( Id BIGINT ,Created TIMESTAMP ,Modified TIMESTAMP ,Status VARCHAR(50) ,ActivationDate TIMESTAMP ,StatusReason VARCHAR(50) ,ApplicationId BIGINT ,IsFraud INT ,LimitPackage VARCHAR(50) ,AccountType VARCHAR(50) ,IsEmployeeAccount INT ,IsOneTimeFeePaid INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1"); LOAD DATA INPATH 'adl://datalake.azuredatalakestore.net/LandingZone/Accounts' OVERWRITE INTO TABLE ingest.txt_accounts; DROP TABLE IF EXISTS ingest.accounts; CREATE TABLE IF NOT EXISTS ingest.accounts ( Id BIGINT ,Created TIMESTAMP ,Modified TIMESTAMP ,Status VARCHAR(50) ,ActivationDate TIMESTAMP ,StatusReason VARCHAR(50) ,ApplicationId BIGINT ,IsFraud INT ,LimitPackage VARCHAR(50) ,AccountType VARCHAR(50) ,IsEmployeeAccount INT ,IsOneTimeFeePaid INT ) CLUSTERED BY (Created) INTO 10 BUCKETS STORED AS ORC TBLPROPERTIES("transactional" = "true"); INSERT INTO TABLE ingest.accounts SELECT Id ,Created ,Modified ,Status ,ActivationDate ,StatusReason ,ApplicationId ,IsFraud ,LimitPackage ,AccountType ,IsEmployeeAccount ,IsOneTimeFeePaid FROM ingest.txt_accounts;
Created 03-07-2017 10:06 PM
The table statistics are used to provide the count of rows. I was using SELECT COUNT(*) FROM... to validate my loads which was reporting the wrong number of rows. I went through the TEZ logs and saw that my final loads were reporting the correct number of rows in the counters.
Created 03-06-2017 10:06 PM
This sounds interesting. Could you share the product version you are using, which client is used here to connect to Hive and also the ACID settings you perform on your client with the "SET .." option.
Created 03-06-2017 10:49 PM
I'm on Azure HDInsight 3.5.1 which deploys HDP 2.5. I'm using the Spark cluster on Linux VMs.
These are the settings added to the ARM template that I use to deploy the cluster to enable ACID support.
"hive.support.concurrency": true,
"hive.enforce.bucketing": true,
"hive.exec.dynamic.partition.mode": "nonstrict",
"hive.txn.manager": "org.apache.hadoop.hive.ql.lockmgr.DbTxnManager",
"hive.compactor.initiator.on": true,
"hive.compactor.worker.threads": 1
I'm not using any SET commands on the client. I use Visual Studio Azure SDK and Ambari to run queries and I have tested with both.
Created 03-06-2017 11:54 PM
Could you execute - "set hive.enforce.bucketing;" on your client and verify it is set to true. If not explicitly "set hive.enforce.bucketing=true;" on your client and try recreating the scenario again.
Created 03-07-2017 02:21 AM
I just tried to reproduce the issue again before trying your suggestion to set hive.enforce.bucketing at the client and it is working as expected. I ran this code all day yesterday and didn't get the expected results once. I have a support ticket in with Microsoft so I will ask them if there were any issues with the system yesterday.
Created 03-07-2017 01:54 AM
Not related, but I noticed that you use heavily VARCHAR data type. That is a known performance issue. Try to use STRING instead.
Coming back to your issue, I assume that both columns (ID, AccountType) are not null, otherwise if they have null values, that could explain the difference on number of records dropped.
Created 03-07-2017 01:56 AM
I will take your comments about STRING under advisement. None of the values for those columns is ever null for a row.
Created 03-07-2017 10:06 PM
The table statistics are used to provide the count of rows. I was using SELECT COUNT(*) FROM... to validate my loads which was reporting the wrong number of rows. I went through the TEZ logs and saw that my final loads were reporting the correct number of rows in the counters.