About sean_anderson

sean_anderson · ‎03-07-2017

The table statistics are used to provide the count of rows. I was using SELECT COUNT(*) FROM... to validate my loads which was reporting the wrong number of rows. I went through the TEZ logs and saw that my final loads were reporting the correct number of rows in the counters. Bug: https://issues.apache.org/jira/browse/HIVE-11266

sean_anderson · ‎03-07-2017

I just tried to reproduce the issue again before trying your suggestion to set hive.enforce.bucketing at the client and it is working as expected. I ran this code all day yesterday and didn't get the expected results once. I have a support ticket in with Microsoft so I will ask them if there were any issues with the system yesterday.

sean_anderson · ‎03-07-2017

I will take your comments about STRING under advisement. None of the values for those columns is ever null for a row.

sean_anderson · ‎03-06-2017

I'm on Azure HDInsight 3.5.1 which deploys HDP 2.5. I'm using the Spark cluster on Linux VMs. These are the settings added to the ARM template that I use to deploy the cluster to enable ACID support. "hive.support.concurrency": true, "hive.enforce.bucketing": true, "hive.exec.dynamic.partition.mode": "nonstrict", "hive.txn.manager": "org.apache.hadoop.hive.ql.lockmgr.DbTxnManager", "hive.compactor.initiator.on": true, "hive.compactor.worker.threads": 1 I'm not using any SET commands on the client. I use Visual Studio Azure SDK and Ambari to run queries and I have tested with both.

sean_anderson · ‎03-06-2017

The following code works unless you change the bucketed column to Id or AccountType. When using those columns the job processes without any errors but some of the rows are dropped or ignored and do not end up in the accounts table. I started out using AccountType and 1.9MM rows were dropped every time I ran the job (20 - 30 times). I changed to Id after troubleshooting and eliminating other issues and found that only 32k rows were dropped. Finally I tried bucketing on Created (timestamp) and all rows were loaded into accounts table. Looking at the data that was skipped I could not see any pattern that would point to a specific bucket file. DROP TABLE IF EXISTS ingest.txt_accounts; CREATE TABLE IF NOT EXISTS ingest.txt_accounts ( Id BIGINT ,Created TIMESTAMP ,Modified TIMESTAMP ,Status VARCHAR(50) ,ActivationDate TIMESTAMP ,StatusReason VARCHAR(50) ,ApplicationId BIGINT ,IsFraud INT ,LimitPackage VARCHAR(50) ,AccountType VARCHAR(50) ,IsEmployeeAccount INT ,IsOneTimeFeePaid INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1"); LOAD DATA INPATH 'adl://datalake.azuredatalakestore.net/LandingZone/Accounts' OVERWRITE INTO TABLE ingest.txt_accounts; DROP TABLE IF EXISTS ingest.accounts; CREATE TABLE IF NOT EXISTS ingest.accounts ( Id BIGINT ,Created TIMESTAMP ,Modified TIMESTAMP ,Status VARCHAR(50) ,ActivationDate TIMESTAMP ,StatusReason VARCHAR(50) ,ApplicationId BIGINT ,IsFraud INT ,LimitPackage VARCHAR(50) ,AccountType VARCHAR(50) ,IsEmployeeAccount INT ,IsOneTimeFeePaid INT ) CLUSTERED BY (Created) INTO 10 BUCKETS STORED AS ORC TBLPROPERTIES("transactional" = "true"); INSERT INTO TABLE ingest.accounts SELECT Id ,Created ,Modified ,Status ,ActivationDate ,StatusReason ,ApplicationId ,IsFraud ,LimitPackage ,AccountType ,IsEmployeeAccount ,IsOneTimeFeePaid FROM ingest.txt_accounts;

Online	Offline
Last Visited	‎03-08-2017 01:53 AM

Member Since	‎03-06-2017 04:28 PM
Last Visited	‎03-08-2017 01:53 AM
Posts	5

Cloudera Community

Re: Loading bucketed table ignores or skips rows

Re: Loading bucketed table ignores or skips rows

Re: Loading bucketed table ignores or skips rows

Re: Loading bucketed table ignores or skips rows

Re: Loading bucketed table ignores or skips rows

Loading bucketed table ignores or skips rows