question Re: Hive corrupting or displaying data corruptly in Archives of Support Questions (Read Only)

Hive corrupting or displaying data corruptly

jankytara — Wed, 29 Jun 2016 17:39:15 GMT

Hello,

I encountered weird problem. I pointed external table to data in HDFS. Source file have non-compressed pipe delimited about 5gb. When I run wc -l /hdfs/fileA.arc, it results in 80,002,783 rows, but when I query select count(*) from tableA, I get 16,877,533.

I examined the file and there are no weird characters, blanks, etc...

Did I do something wrong? Shouldn't count of rows be the same? Does Hive automatically remove duplicates?

Thanks

Re: Hive corrupting or displaying data corruptly

jyadav — Wed, 29 Jun 2016 17:46:37 GMT

Can you please share the table DDL from below command?

show create table <tablename>

Re: Hive corrupting or displaying data corruptly

ssubhas — Wed, 29 Jun 2016 17:46:43 GMT

@Jan Kytara

Can you please share the table definition?

Re: Hive corrupting or displaying data corruptly

jankytara — Wed, 29 Jun 2016 18:18:17 GMT

CREATE EXTERNAL TABLE corrupt_rows
(
   A   INT,
   B   BIGINT,
   C   STRING,
   D   STRING,
   E   STRING,
   F   STRING,
   G   DOUBLE,
   H   INT,
   I   DOUBLE,
   J   INT,
   K   STRING,
   L   STRING
)
COMMENT 'xy'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS
   INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs://xy:8020/data/temp'
TBLPROPERTIES ('COLUMN_STATS_ACCURATE' = 'false',
               'numFiles' = '1',
               'numRows' = '-1',
               'rawDataSize' = '-1',
               'skip.header.line.count' = '1',
               'totalSize' = '4969304654',
               'transient_lastDdlTime' = '1467196659')

Re: Hive corrupting or displaying data corruptly

jankytara — Wed, 29 Jun 2016 19:35:41 GMT

@Jitendra Yadav @Sindhu

Re: Hive corrupting or displaying data corruptly

bpreachuk — Wed, 29 Jun 2016 20:11:49 GMT

Hi @Jan Kytara. Can you please update statistics on the table - run the command:

analyze table corrupt_rows compute statistics ;

Also would love to know if "select * from corrupt_rows limit nnn ;" returns properly formed rows with columns A..L, or if it has junk or boundaries. That could point to a delimiter issue.

Re: Hive corrupting or displaying data corruptly

jankytara — Wed, 29 Jun 2016 20:57:52 GMT

Okay. I played around. After removing

'skip.header.line.count'='1'

and creating new external table, then count(*) = wc -l

I will include header and sample row, I don't find any irreguralites, only difference is that header lacks 2 columns from DDL definition (K, L), which should not be a problem:

 A |B |C |D |E |F |G |H |I |J

+04454.|+133322063.|A42AL|201618|20160702|N|+00000.00|0|+00001.11|0

Out of curiosity I created table without columns "K,L" in order to match header row. With option

'skip.header.line.count'='1'

it gives wrong result count(*) <> wc -l. Without, it gives right result.

By what is this caused? Can someone test this out using big table? I am running Hadoop 2.7.1.2.4.0.0-169

Re: Hive corrupting or displaying data corruptly

jyadav — Thu, 30 Jun 2016 03:44:56 GMT

Hi @Jan Kytara

I tried reproducing this issue on my cluster with your dataset and table but looks like it working fine even with skip.header.line.count parameter.

hive> select count(*) from corrupt_rows;
Query ID = hdfs_20160622192941_a2505b4a-96a7-4148-87ce-a52e92bd75c7
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1466074160497_0010)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 4.89 s
--------------------------------------------------------------------------------
OK
90
Time taken: 5.467 seconds, Fetched: 1 row(s)

-bash-4.1$ wc -l data.txt
91 data.txt
-bash-4.1$

Which HDP version you are using?

Re: Hive corrupting or displaying data corruptly

pminovic — Thu, 30 Jun 2016 08:07:20 GMT

According to Hive's JIRA for skipping header and footer rows (see comments at the bottom), it seems it works as expected only for tables represented by a single split (file). For the time being, to avoid troubles it's the best to remove headers and footers beforehand, and refrain from using skip.header.line.count and skip.footer.line.count.

Re: Hive corrupting or displaying data corruptly

jankytara — Mon, 04 Jul 2016 15:59:20 GMT

Hi, It works fine for me with small tables. Seems to only corrupt data in tables bigger than 1 block.

Hadoop 2.7.1.2.4.0.0-169