Support Questions

Find answers, ask questions, and share your expertise

Hive corrupting or displaying data corruptly

avatar
Contributor

Hello,

I encountered weird problem. I pointed external table to data in HDFS. Source file have non-compressed pipe delimited about 5gb. When I run wc -l /hdfs/fileA.arc, it results in 80,002,783 rows, but when I query select count(*) from tableA, I get 16,877,533.

I examined the file and there are no weird characters, blanks, etc...

Did I do something wrong? Shouldn't count of rows be the same? Does Hive automatically remove duplicates?

Thanks

1 ACCEPTED SOLUTION

avatar
Master Guru

According to Hive's JIRA for skipping header and footer rows (see comments at the bottom), it seems it works as expected only for tables represented by a single split (file). For the time being, to avoid troubles it's the best to remove headers and footers beforehand, and refrain from using skip.header.line.count and skip.footer.line.count.

View solution in original post

9 REPLIES 9

avatar
Super Guru

Can you please share the table DDL from below command?

show create table <tablename>

avatar
Super Guru

Hi @Jan Kytara

I tried reproducing this issue on my cluster with your dataset and table but looks like it working fine even with skip.header.line.count parameter.

hive> select count(*) from corrupt_rows;
Query ID = hdfs_20160622192941_a2505b4a-96a7-4148-87ce-a52e92bd75c7
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1466074160497_0010)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 4.89 s
--------------------------------------------------------------------------------
OK
90
Time taken: 5.467 seconds, Fetched: 1 row(s)

-bash-4.1$ wc -l data.txt
91 data.txt
-bash-4.1$

Which HDP version you are using?

avatar
Contributor

Hi, It works fine for me with small tables. Seems to only corrupt data in tables bigger than 1 block.

Hadoop 2.7.1.2.4.0.0-169

avatar
@Jan Kytara

Can you please share the table definition?

avatar
Contributor
CREATE EXTERNAL TABLE corrupt_rows
(
   A   INT,
   B   BIGINT,
   C   STRING,
   D   STRING,
   E   STRING,
   F   STRING,
   G   DOUBLE,
   H   INT,
   I   DOUBLE,
   J   INT,
   K   STRING,
   L   STRING
)
COMMENT 'xy'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS
   INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs://xy:8020/data/temp'
TBLPROPERTIES ('COLUMN_STATS_ACCURATE' = 'false',
               'numFiles' = '1',
               'numRows' = '-1',
               'rawDataSize' = '-1',
               'skip.header.line.count' = '1',
               'totalSize' = '4969304654',
               'transient_lastDdlTime' = '1467196659')

avatar
Contributor

avatar

Hi @Jan Kytara. Can you please update statistics on the table - run the command:

analyze table corrupt_rows compute statistics ; 

Also would love to know if "select * from corrupt_rows limit nnn ;" returns properly formed rows with columns A..L, or if it has junk or boundaries. That could point to a delimiter issue.

avatar
Contributor

Okay. I played around. After removing

'skip.header.line.count'='1'

and creating new external table, then count(*) = wc -l

I will include header and sample row, I don't find any irreguralites, only difference is that header lacks 2 columns from DDL definition (K, L), which should not be a problem:

 A |B |C |D |E |F |G |H |I |J

+04454.|+133322063.|A42AL|201618|20160702|N|+00000.00|0|+00001.11|0


Out of curiosity I created table without columns "K,L" in order to match header row. With option

'skip.header.line.count'='1'

it gives wrong result count(*) <> wc -l. Without, it gives right result.

By what is this caused? Can someone test this out using big table? I am running Hadoop 2.7.1.2.4.0.0-169

avatar
Master Guru

According to Hive's JIRA for skipping header and footer rows (see comments at the bottom), it seems it works as expected only for tables represented by a single split (file). For the time being, to avoid troubles it's the best to remove headers and footers beforehand, and refrain from using skip.header.line.count and skip.footer.line.count.