<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Hive corrupting or displaying data corruptly in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103377#M33325</link>
    <description>&lt;P&gt;
	According to &lt;A href="https://issues.apache.org/jira/browse/HIVE-5795"&gt;Hive's JIRA&lt;/A&gt; for skipping header and footer rows (see comments at the bottom), it seems it works as expected only for tables represented by a single split (file). For the time being, to avoid troubles it's the best to remove headers and footers beforehand, and refrain from using skip.header.line.count and skip.footer.line.count.&lt;/P&gt;</description>
    <pubDate>Thu, 30 Jun 2016 08:07:20 GMT</pubDate>
    <dc:creator>pminovic</dc:creator>
    <dc:date>2016-06-30T08:07:20Z</dc:date>
    <item>
      <title>Hive corrupting or displaying data corruptly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103369#M33317</link>
      <description>&lt;P&gt;Hello, &lt;/P&gt;&lt;P&gt;I encountered weird problem. I pointed external table to data in HDFS. Source file have non-compressed pipe delimited about 5gb. When I run wc -l /hdfs/fileA.arc, it results in 80,002,783 rows, but when I query select count(*) from tableA, I get  16,877,533.&lt;/P&gt;&lt;P&gt;I examined the file and there are no weird characters, blanks, etc...&lt;/P&gt;&lt;P&gt;Did I do something wrong? Shouldn't count of rows be the same? Does Hive automatically remove duplicates?&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jun 2016 17:39:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103369#M33317</guid>
      <dc:creator>jankytara</dc:creator>
      <dc:date>2016-06-29T17:39:15Z</dc:date>
    </item>
    <item>
      <title>Re: Hive corrupting or displaying data corruptly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103370#M33318</link>
      <description>&lt;P&gt;Can you please share the table DDL from below command?&lt;/P&gt;&lt;PRE&gt;show create table &amp;lt;tablename&amp;gt;&lt;/PRE&gt;</description>
      <pubDate>Wed, 29 Jun 2016 17:46:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103370#M33318</guid>
      <dc:creator>jyadav</dc:creator>
      <dc:date>2016-06-29T17:46:37Z</dc:date>
    </item>
    <item>
      <title>Re: Hive corrupting or displaying data corruptly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103371#M33319</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/10784/jankytara.html" nodeid="10784"&gt;@Jan Kytara&lt;/A&gt;&lt;P&gt;Can you please share the table definition?&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jun 2016 17:46:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103371#M33319</guid>
      <dc:creator>ssubhas</dc:creator>
      <dc:date>2016-06-29T17:46:43Z</dc:date>
    </item>
    <item>
      <title>Re: Hive corrupting or displaying data corruptly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103372#M33320</link>
      <description>&lt;PRE&gt;CREATE EXTERNAL TABLE corrupt_rows
(
   A   INT,
   B   BIGINT,
   C   STRING,
   D   STRING,
   E   STRING,
   F   STRING,
   G   DOUBLE,
   H   INT,
   I   DOUBLE,
   J   INT,
   K   STRING,
   L   STRING
)
COMMENT 'xy'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS
   INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs://xy:8020/data/temp'
TBLPROPERTIES ('COLUMN_STATS_ACCURATE' = 'false',
               'numFiles' = '1',
               'numRows' = '-1',
               'rawDataSize' = '-1',
               'skip.header.line.count' = '1',
               'totalSize' = '4969304654',
               'transient_lastDdlTime' = '1467196659')&lt;/PRE&gt;</description>
      <pubDate>Wed, 29 Jun 2016 18:18:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103372#M33320</guid>
      <dc:creator>jankytara</dc:creator>
      <dc:date>2016-06-29T18:18:17Z</dc:date>
    </item>
    <item>
      <title>Re: Hive corrupting or displaying data corruptly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103373#M33321</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/2528/jyadav.html" nodeid="2528"&gt;@Jitendra Yadav
@Sindhu&lt;/A&gt; &lt;/P&gt;</description>
      <pubDate>Wed, 29 Jun 2016 19:35:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103373#M33321</guid>
      <dc:creator>jankytara</dc:creator>
      <dc:date>2016-06-29T19:35:41Z</dc:date>
    </item>
    <item>
      <title>Re: Hive corrupting or displaying data corruptly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103374#M33322</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/10784/jankytara.html" nodeid="10784"&gt;@Jan Kytara&lt;/A&gt;. Can you please update statistics on the table - run the command: &lt;/P&gt;&lt;PRE&gt;analyze table corrupt_rows compute statistics ; &lt;/PRE&gt;&lt;P&gt;Also would love to know if  "&lt;STRONG&gt;select * from corrupt_rows limit nnn ;&lt;/STRONG&gt;" returns properly formed rows with columns A..L, or if it has junk or boundaries.  That could point to a delimiter issue.      &lt;/P&gt;</description>
      <pubDate>Wed, 29 Jun 2016 20:11:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103374#M33322</guid>
      <dc:creator>bpreachuk</dc:creator>
      <dc:date>2016-06-29T20:11:49Z</dc:date>
    </item>
    <item>
      <title>Re: Hive corrupting or displaying data corruptly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103375#M33323</link>
      <description>&lt;P&gt;Okay. I played around. After removing &lt;/P&gt;&lt;PRE&gt;'skip.header.line.count'='1'&lt;/PRE&gt;&lt;P&gt;and creating new external table, then count(*) = wc -l&lt;/P&gt;&lt;P&gt;I will include header and sample row, I don't find any irreguralites, only difference is that header lacks 2 columns from DDL definition (K, L), which should not be a problem: &lt;/P&gt;&lt;PRE&gt; A |B |C |D |E |F |G |H |I |J

+04454.|+133322063.|A42AL|201618|20160702|N|+00000.00|0|+00001.11|0


&lt;/PRE&gt;&lt;P&gt;Out of curiosity I created table without columns "K,L" in order to match header row. With option &lt;/P&gt;&lt;PRE&gt;'skip.header.line.count'='1'&lt;/PRE&gt;&lt;P&gt;it gives wrong result count(*) &amp;lt;&amp;gt; wc -l. Without, it gives right result.&lt;/P&gt;&lt;P&gt;By what is this caused? Can someone test this out using big table? I am running  Hadoop 2.7.1.2.4.0.0-169&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jun 2016 20:57:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103375#M33323</guid>
      <dc:creator>jankytara</dc:creator>
      <dc:date>2016-06-29T20:57:52Z</dc:date>
    </item>
    <item>
      <title>Re: Hive corrupting or displaying data corruptly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103376#M33324</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/10784/jankytara.html" nodeid="10784"&gt;@Jan Kytara&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I tried reproducing this issue on my cluster with your dataset and table but looks like it working fine even with skip.header.line.count parameter.&lt;/P&gt;&lt;PRE&gt;hive&amp;gt; select count(*) from corrupt_rows;
Query ID = hdfs_20160622192941_a2505b4a-96a7-4148-87ce-a52e92bd75c7
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1466074160497_0010)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================&amp;gt;&amp;gt;] 100%  ELAPSED TIME: 4.89 s
--------------------------------------------------------------------------------
OK
90
Time taken: 5.467 seconds, Fetched: 1 row(s)

&lt;/PRE&gt;&lt;PRE&gt;-bash-4.1$ wc -l data.txt
91 data.txt
-bash-4.1$

&lt;/PRE&gt;&lt;P&gt;Which HDP version you are using?&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jun 2016 03:44:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103376#M33324</guid>
      <dc:creator>jyadav</dc:creator>
      <dc:date>2016-06-30T03:44:56Z</dc:date>
    </item>
    <item>
      <title>Re: Hive corrupting or displaying data corruptly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103377#M33325</link>
      <description>&lt;P&gt;
	According to &lt;A href="https://issues.apache.org/jira/browse/HIVE-5795"&gt;Hive's JIRA&lt;/A&gt; for skipping header and footer rows (see comments at the bottom), it seems it works as expected only for tables represented by a single split (file). For the time being, to avoid troubles it's the best to remove headers and footers beforehand, and refrain from using skip.header.line.count and skip.footer.line.count.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jun 2016 08:07:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103377#M33325</guid>
      <dc:creator>pminovic</dc:creator>
      <dc:date>2016-06-30T08:07:20Z</dc:date>
    </item>
    <item>
      <title>Re: Hive corrupting or displaying data corruptly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103378#M33326</link>
      <description>&lt;P&gt;Hi, It works fine for me with small tables. Seems to only corrupt data in tables bigger than 1 block.&lt;/P&gt;&lt;P&gt;Hadoop 2.7.1.2.4.0.0-169&lt;/P&gt;</description>
      <pubDate>Mon, 04 Jul 2016 15:59:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-corrupting-or-displaying-data-corruptly/m-p/103378#M33326</guid>
      <dc:creator>jankytara</dc:creator>
      <dc:date>2016-07-04T15:59:20Z</dc:date>
    </item>
  </channel>
</rss>

