Support Questions

balavignesh_nag · ‎08-11-2016

Value differs when I check the count(*) using hive query and when I check the no of lines in the corresponding file. Seelect count(*) from table gives 100 as output where as wc -l of the corresponding file gives 30. My table is in ORC format. Is it how orc behaves?

ssubhas · ‎08-11-2016

@Bala Vignesh N V

Yes, the difference in count is expected. ORC converts the table data into groups of rows called stripes along with auxiliary information in a file footer, default size of stripe is 250 MB. Hence, there will be difference in wc -l on orc file compared to actual numbers of rows in the table.

View solution in original post

ssubhas · ‎08-11-2016

@Bala Vignesh N V

Yes, the difference in count is expected. ORC converts the table data into groups of rows called stripes along with auxiliary information in a file footer, default size of stripe is 250 MB. Hence, there will be difference in wc -l on orc file compared to actual numbers of rows in the table.

balavignesh_nag · ‎08-11-2016

Thanks Sindhu. What happens when we store it as TEXTFILE. The count should remain the same right?

ssubhas · ‎08-11-2016

Yes, it would be same.

Cloudera Community

Support Questions

count(*) in hive and the wc of hadoop file is different.

Counting lines in text files with NiFi - part 1

Cloud Storage File System Operations with the Hado...

Hadoop Distcp -update skips file

Count total number of hive tables across all datab...

Hive partitions on different Namespaces in a Feder...

HPs BDRA - What is different from Traditional Hado...

get counts of rows meeting different filter criter...

How to compact ORC files on Hive.

Hadoop Security Concepts

Geo-spatial Queries with Hive using ESRI Geometry ...