Support Questions
Find answers, ask questions, and share your expertise

count(*) in hive and the wc of hadoop file is different.

Solved Go to solution
Highlighted

count(*) in hive and the wc of hadoop file is different.

Value differs when I check the count(*) using hive query and when I check the no of lines in the corresponding file. Seelect count(*) from table gives 100 as output where as wc -l of the corresponding file gives 30. My table is in ORC format. Is it how orc behaves?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: count(*) in hive and the wc of hadoop file is different.

@Bala Vignesh N V

Yes, the difference in count is expected. ORC converts the table data into groups of rows called stripes along with auxiliary information in a file footer, default size of stripe is 250 MB. Hence, there will be difference in wc -l on orc file compared to actual numbers of rows in the table.

View solution in original post

3 REPLIES 3
Highlighted

Re: count(*) in hive and the wc of hadoop file is different.

@Bala Vignesh N V

Yes, the difference in count is expected. ORC converts the table data into groups of rows called stripes along with auxiliary information in a file footer, default size of stripe is 250 MB. Hence, there will be difference in wc -l on orc file compared to actual numbers of rows in the table.

View solution in original post

Highlighted

Re: count(*) in hive and the wc of hadoop file is different.

Thanks Sindhu. What happens when we store it as TEXTFILE. The count should remain the same right?

Highlighted

Re: count(*) in hive and the wc of hadoop file is different.

Yes, it would be same.

Don't have an account?