Member since
03-06-2017
5
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3304 | 03-07-2017 10:06 PM |
03-07-2017
10:06 PM
The table statistics are used to provide the count of rows. I was using SELECT COUNT(*) FROM... to validate my loads which was reporting the wrong number of rows. I went through the TEZ logs and saw that my final loads were reporting the correct number of rows in the counters. Bug: https://issues.apache.org/jira/browse/HIVE-11266
... View more
03-07-2017
02:21 AM
I just tried to reproduce the issue again before trying your suggestion to set hive.enforce.bucketing at the client and it is working as expected. I ran this code all day yesterday and didn't get the expected results once. I have a support ticket in with Microsoft so I will ask them if there were any issues with the system yesterday.
... View more
03-07-2017
01:56 AM
I will take your comments about STRING under advisement. None of the values for those columns is ever null for a row.
... View more
03-06-2017
10:49 PM
I'm on Azure HDInsight 3.5.1 which deploys HDP 2.5. I'm using the Spark cluster on Linux VMs. These are the settings added to the ARM template that I use to deploy the cluster to enable ACID support. "hive.support.concurrency": true, "hive.enforce.bucketing": true, "hive.exec.dynamic.partition.mode": "nonstrict",
"hive.txn.manager": "org.apache.hadoop.hive.ql.lockmgr.DbTxnManager", "hive.compactor.initiator.on": true, "hive.compactor.worker.threads": 1 I'm not using any SET commands on the client. I use Visual Studio Azure SDK and Ambari to run queries and I have tested with both.
... View more
03-06-2017
05:02 PM
The following code works unless you change the bucketed column to Id or AccountType. When using those columns the job processes without any errors but some of the rows are dropped or ignored and do not end up in the accounts table. I started out using AccountType and 1.9MM rows were dropped every time I ran the job (20 - 30 times). I changed to Id after troubleshooting and eliminating other issues and found that only 32k rows were dropped. Finally I tried bucketing on Created (timestamp) and all rows were loaded into accounts table. Looking at the data that was skipped I could not see any pattern that would point to a specific bucket file. DROP TABLE IF EXISTS ingest.txt_accounts;
CREATE TABLE IF NOT EXISTS ingest.txt_accounts (
Id BIGINT
,Created TIMESTAMP
,Modified TIMESTAMP
,Status VARCHAR(50)
,ActivationDate TIMESTAMP
,StatusReason VARCHAR(50)
,ApplicationId BIGINT
,IsFraud INT
,LimitPackage VARCHAR(50)
,AccountType VARCHAR(50)
,IsEmployeeAccount INT
,IsOneTimeFeePaid INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
LOAD DATA INPATH 'adl://datalake.azuredatalakestore.net/LandingZone/Accounts'
OVERWRITE INTO TABLE ingest.txt_accounts;
DROP TABLE IF EXISTS ingest.accounts;
CREATE TABLE IF NOT EXISTS ingest.accounts (
Id BIGINT
,Created TIMESTAMP
,Modified TIMESTAMP
,Status VARCHAR(50)
,ActivationDate TIMESTAMP
,StatusReason VARCHAR(50)
,ApplicationId BIGINT
,IsFraud INT
,LimitPackage VARCHAR(50)
,AccountType VARCHAR(50)
,IsEmployeeAccount INT
,IsOneTimeFeePaid INT
)
CLUSTERED BY (Created) INTO 10 BUCKETS
STORED AS ORC
TBLPROPERTIES("transactional" = "true");
INSERT INTO TABLE ingest.accounts
SELECT
Id
,Created
,Modified
,Status
,ActivationDate
,StatusReason
,ApplicationId
,IsFraud
,LimitPackage
,AccountType
,IsEmployeeAccount
,IsOneTimeFeePaid
FROM ingest.txt_accounts;
... View more