Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

select count(*) doesn't match orcfiledump results

Highlighted

select count(*) doesn't match orcfiledump results

I'm looking for a faster way of getting a row count from a partition. select count(*) does a good job, but it takes too long and requires too many resources.

 

I'm trying to use hive --orcfiledump command to get the correct value, finding lines that look like this:

 

Stripe: offset: 3 data: 18297929 rows: 76771 tail: 338 index: 6535

 

and adding up the "rows" values. Unfortunately, the total I'm getting is significantly off. In one case, count(*) gets 309250764 (which matches the rows output from my reducer), but orcfiledump tells me there are 312247462 rows.

 

Can anyone help me understand the difference?

1 REPLY 1

Re: select count(*) doesn't match orcfiledump results

Guru

Hi,

 

I have tested and it looks like working for me. I got the following sample output:

 

 

16/11/19 00:14:38 WARN mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present.  Continuing without it.
Structure for /user/hive/warehouse/orc_test.db/test_orc/000017_0
File Version: 0.12 with HIVE_8732
16/11/19 00:14:41 INFO orc.ReaderImpl: Reading ORC rows from /user/hive/warehouse/orc_test.db/test_orc/000017_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 5046226
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:string,_col2:int,_col3:int>
....


Stripes:
  Stripe: offset: 3 data: 64482559 rows: 4175000 tail: 97 index: 34390
    Stream: column 0 section ROW_INDEX start: 3 length 58
    Stream: column 1 section ROW_INDEX start: 61 length 29726
....

  Stripe: offset: 64517049 data: 20001356 rows: 871226 tail: 91 index: 11031
    Stream: column 0 section ROW_INDEX start: 64517049 length 28
    Stream: column 1 section ROW_INDEX start: 64517077 length 9861
    Stream: column 2 section ROW_INDEX start: 64526938 length 374
....

The information you need to look for is:

Rows: 5046226

 

Not under:

Stripe: offset: 3 data: 64482559 rows: 4175000 tail: 97 index: 34390

 

I have 17 files and after I added all the number under Rows: xxxxxxxx, I got the correct count number.

 

Please try again to see if it works for you.