Created 06-29-2016 10:39 AM
Hello,
I encountered weird problem. I pointed external table to data in HDFS. Source file have non-compressed pipe delimited about 5gb. When I run wc -l /hdfs/fileA.arc, it results in 80,002,783 rows, but when I query select count(*) from tableA, I get 16,877,533.
I examined the file and there are no weird characters, blanks, etc...
Did I do something wrong? Shouldn't count of rows be the same? Does Hive automatically remove duplicates?
Thanks
Created 06-30-2016 01:07 AM
According to Hive's JIRA for skipping header and footer rows (see comments at the bottom), it seems it works as expected only for tables represented by a single split (file). For the time being, to avoid troubles it's the best to remove headers and footers beforehand, and refrain from using skip.header.line.count and skip.footer.line.count.
Created 06-29-2016 10:46 AM
Can you please share the table DDL from below command?
show create table <tablename>
Created 06-29-2016 08:44 PM
Hi @Jan Kytara
I tried reproducing this issue on my cluster with your dataset and table but looks like it working fine even with skip.header.line.count parameter.
hive> select count(*) from corrupt_rows; Query ID = hdfs_20160622192941_a2505b4a-96a7-4148-87ce-a52e92bd75c7 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1466074160497_0010) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 4.89 s -------------------------------------------------------------------------------- OK 90 Time taken: 5.467 seconds, Fetched: 1 row(s)
-bash-4.1$ wc -l data.txt 91 data.txt -bash-4.1$
Which HDP version you are using?
Created 07-04-2016 08:59 AM
Hi, It works fine for me with small tables. Seems to only corrupt data in tables bigger than 1 block.
Hadoop 2.7.1.2.4.0.0-169
Created 06-29-2016 10:46 AM
Can you please share the table definition?
Created 06-29-2016 11:18 AM
CREATE EXTERNAL TABLE corrupt_rows ( A INT, B BIGINT, C STRING, D STRING, E STRING, F STRING, G DOUBLE, H INT, I DOUBLE, J INT, K STRING, L STRING ) COMMENT 'xy' ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://xy:8020/data/temp' TBLPROPERTIES ('COLUMN_STATS_ACCURATE' = 'false', 'numFiles' = '1', 'numRows' = '-1', 'rawDataSize' = '-1', 'skip.header.line.count' = '1', 'totalSize' = '4969304654', 'transient_lastDdlTime' = '1467196659')
Created 06-29-2016 12:35 PM
Created 06-29-2016 01:11 PM
Hi @Jan Kytara. Can you please update statistics on the table - run the command:
analyze table corrupt_rows compute statistics ;
Also would love to know if "select * from corrupt_rows limit nnn ;" returns properly formed rows with columns A..L, or if it has junk or boundaries. That could point to a delimiter issue.
Created 06-29-2016 01:57 PM
Okay. I played around. After removing
'skip.header.line.count'='1'
and creating new external table, then count(*) = wc -l
I will include header and sample row, I don't find any irreguralites, only difference is that header lacks 2 columns from DDL definition (K, L), which should not be a problem:
A |B |C |D |E |F |G |H |I |J +04454.|+133322063.|A42AL|201618|20160702|N|+00000.00|0|+00001.11|0
Out of curiosity I created table without columns "K,L" in order to match header row. With option
'skip.header.line.count'='1'
it gives wrong result count(*) <> wc -l. Without, it gives right result.
By what is this caused? Can someone test this out using big table? I am running Hadoop 2.7.1.2.4.0.0-169
Created 06-30-2016 01:07 AM
According to Hive's JIRA for skipping header and footer rows (see comments at the bottom), it seems it works as expected only for tables represented by a single split (file). For the time being, to avoid troubles it's the best to remove headers and footers beforehand, and refrain from using skip.header.line.count and skip.footer.line.count.